Compressing text files using emoji
The opinions stated here are my own, not those of my company.
Several weeks ago I had a dream. In this dream I was at my computer writing code (very true-to-life) and I was working on a file compression tool which would use word frequencies, convert them to emoji, and then you would wind up with a smaller file.
When I woke up and ruminated on this dream, I thought that it could actually work. So I took some time to sit down and write a proof-of-concept.
In order to really prove out this idea, I needed a large text file. I ended up grabbing the HTML version of 20,000 Leagues Under the Sea from Project Gutenberg.
To make the proof-of-concept easier, I grabbed the words-frequency package from npm. It very simply gives me a mapping of how often each word is used. Then, I can sort by the most common and start replacing them with emoji.
So my algorithm is pretty straightforward. Do a series of string replacements and save the output as a new file. I decided to call the extension .ezip
, standing for emozip.
The mappings of emoji to text are placed at the top of the file, between BEGINHEADER
and ENDHEADER
. Each emoji is then followed by the original string on a new line. Below the header is the original text. This makes it easy to go backwards and decompress the file.
A shortened version of the compiled file looks like this:
So with some quick iterating I finally got my compressed text. Now how compressed is it?
As it turns out, my ‘compression’ actually made the file larger. Why is that?
Characters are represented in a file by numbers. Latin characters, like our alphabet, are the smallest numbers, so the file system requires less space to hold them. Emoji, on the other hand, are four-bytes long for each character (at least). So replacing common words like ‘a’ and ‘the’ are actually turning 1-byte and 3-byte words into 4-byte words. In effect, I’m replacing smaller data with larger data in the file.
I tried a second iteration. This time, I mandated that only most common words with more than 4 letters would be turned into emoji. That way, five-byte words would be correctly compressed into four-byte emoji.
There are far fewer emoji scattered about, but let’s see if there is a large improvement in file size.
There is actually an improvement, but it isn’t that significant. It amounts to about a 5.5% reduction. That’s not bad but it really doesn’t suggest that I should make everything emoji.
File compression is a huge field. There’s an ever-growing collection of data and a need to store it somewhere. If we can perform lossless compression, we can reduce storage costs and give a huge savings.
There are many text and binary data compression algorithms. But emoji compression? This ain’t it chief.