Compressing text files using emoji

The opinions stated here are my own, not those of my company.

Several weeks ago I had a dream. In this dream I was at my computer writing code (very true-to-life) and I was working on a file compression tool which would use word frequencies, convert them to emoji, and then you would wind up with a smaller file.

When I woke up and ruminated on this dream, I thought that it could actually work. So I took some time to sit down and write a proof-of-concept.

In order to really prove out this idea, I needed a large text file. I ended up grabbing the HTML version of 20,000 Leagues Under the Sea from Project Gutenberg.

To make the proof-of-concept easier, I grabbed the words-frequency package from npm. It very simply gives me a mapping of how often each word is used. Then, I can sort by the most common and start replacing them with emoji.

So my algorithm is pretty straightforward. Do a series of string replacements and save the output as a new file. I decided to call the extension .ezip , standing for emozip.

The mappings of emoji to text are placed at the top of the file, between BEGINHEADER and ENDHEADER . Each emoji is then followed by the original string on a new line. Below the header is the original text. This makes it easy to go backwards and decompress the file.

A shortened version of the compiled file looks like this:

So with some quick iterating I finally got my compressed text. Now how compressed is it?

Oh

As it turns out, my ‘compression’ actually made the file larger. Why is that?

Characters are represented in a file by numbers. Latin characters, like our alphabet, are the smallest numbers, so the file system requires less space to hold them. Emoji, on the other hand, are four-bytes long for each character (at least). So replacing common words like ‘a’ and ‘the’ are actually turning 1-byte and 3-byte words into 4-byte words. In effect, I’m replacing smaller data with larger data in the file.

I tried a second iteration. This time, I mandated that only most common words with more than 4 letters would be turned into emoji. That way, five-byte words would be correctly compressed into four-byte emoji.

There are far fewer emoji scattered about, but let’s see if there is a large improvement in file size.

Hmm

There is actually an improvement, but it isn’t that significant. It amounts to about a 5.5% reduction. That’s not bad but it really doesn’t suggest that I should make everything emoji.

File compression is a huge field. There’s an ever-growing collection of data and a need to store it somewhere. If we can perform lossless compression, we can reduce storage costs and give a huge savings.

There are many text and binary data compression algorithms. But emoji compression? This ain’t it chief.

Social Media Expert -- Rowan University 2017 -- IoT & Assistant @ Google