A Chrome Extension that understands emoji art for screen readers
We are in the era of emoji art.
It getting dark at 4:30pm
￣￣┗ My ability to stay awake
Person is kicked down the stairs with the caption “It getting dark at 4:30pm My ability to stay awake”
As fun as they can be to look at, they are an awful experience for people who are blind or otherwise use screen readers. That accessibility software is fairly simple, just allowing for words to be converted into speech. Every emoji character is read out individually. Other non-alphanumeric characters can cause problems for understanding.
At the same time I am not convinced the solution is to stop posting them. New art forms are good things and previous forms of art have not had to go away due to accessibility. Rather, they adjusted to expand how they could be consumed.
Television has captions. Photos have descriptions which can be heard audibly. Books can be printed in braille.
Machine Learning to Identify Emoji Patterns
Each classification is placed as a pair of files in a data/ directory structure. The first file contains a list of data for training, and the second contains a data structure representing the transformation from the original text to a usable caption.
A model file contains a series of examples for a given art classification. It is processed by <art-name>.training.json. The file contains a JSON-based array, containing an object with two properties:
- text — The original text art
- attribution — A link to the source material
A caption file by the name <art-name>.regex.txt contains the substitution and replacement values separated by a new line. This allows the system to generate an appropriate caption by capturing specific key information. Additional phrases may be used for the regular expression, like ALL_TEXT, which will return all of the readable text on the page.
Using regular expressions here is a bit tricky, and may change in the future, but was just one way to easily encode the necessary text transformations to something readable.
There are a handful of exceptions. A
_training.json file contains a set of training data to verify the ML model. An
emoji-art.regex.txt file represents the caption for an unidentified text art input. Later this unidentified data can be crowdsourced into useful labels.
After going through the work of training the model, the first implementation was in a command-line program just to verify that the inference and captioning worked.
From here, I wrapped up the TensorFlow model into a Chrome Extension. The extension will, when on Twitter, check each tweet against the model. If the ‘none’ label is inferred, no action is taken; any ordinary sentence is already readable.
For those with a different label, the contents of the tweet are replaced inline with the caption. That way, when a user with a screen reader listens to that tweet, they get something that is contextually the same in a format they understand.
This extension works fairly well, though I’m sure there are some optimizations to make it work a bit better in parallel. The use of ML is synchronous operations is not ideal, since it can be a second or longer to interpret each tweet.
But with some headway it can be a useful way to greatly improve the accessibility of any text art without each person remembering to provide alternate text.
As alluded to earlier, having to provide text for every type of emoji art is not easily scalable by one person. Ideally this technology can be implemented by a company providing accessibility software or moved to a crowdsourced model so that maintenance can be sustained.
Since TensorFlow is fairly adaptable, this model could undergo just a few tweaks and be easily embedded into your phones accessibility software or more integrated into an operating system.
This isn’t just limited to Twitter and text art either. Unicode more generally has a wide set of alphanumeric characters in slightly different text styles but cannot be interpreted by screen readers. This Chrome Extension can additionally replace 𝕁 with J, 𝕒 with a, and so on.
Unlike photos, which is made up of thousands of obscure pixels, text is smaller in size and is encoded more precisely. As such getting high accuracy requires fewer examples.
After putting together this proof of concept I did stop working on it. I just pushed the project to GitHub for anyone who might be interested.