Very simple program to generate list of emoji's and their textual description. The list can be used for various tasks but it intended for natural language processing tasks (e.g., sentiment analysis with tweets).
All emoji-data is pulled from the official Unicode Emoji list and converted into a .py
file.
Several precompiled lists can be found in the emoji_list
folder, including unicode (\u231a) and emoji (⌚) representations.
The lists can be generated by running either the .sh
or running the .py
with arguments.
- Running the
.sh
file with preconfigured arguments:chmod +x run.sh # give permissions ./run.sh
- Running the
.py
file:You are given the option to render emojis in unicode or emoji representation and only need to modify thepython3 run.py \ --preprocess yes \ --file unicode_data_files/emoji-sequences.txt \ --render_unicode False \ --save_path emoji_list/emoji-sequence_v16.py
--render_unicode
arg.
I wrote this program because I needed to preprocess some text that has emojis and convert them into their textual descriptions. I briefly searched if something similar existed but I didn't find anything that can be used automatically for the following releases. The current release is dated from 2024-08-25 under Emoji v16.0.
Emoji 17.0 is the presumed release which will provide new emojis alongside Unicode 17.0 in September 2025.
The paper that inspired the program is cited below:
@inproceedings{singh-etal-2019-incorporating,
title = "Incorporating Emoji Descriptions Improves Tweet Classification",
author = "Singh, Abhishek and
Blanco, Eduardo and
Jin, Wei",
editor = "Burstein, Jill and
Doran, Christy and
Solorio, Thamar",
booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
month = jun,
year = "2019",
address = "Minneapolis, Minnesota",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N19-1214/",
doi = "10.18653/v1/N19-1214",
pages = "2096--2101",
abstract = "Tweets are short messages that often include specialized language such as hashtags and emojis. In this paper, we present a simple strategy to process emojis: replace them with their natural language description and use pretrained word embeddings as normally done with standard words. We show that this strategy is more effective than using pretrained emoji embeddings for tweet classification. Specifically, we obtain new state-of-the-art results in irony detection and sentiment analysis despite our neural network is simpler than previous proposals."
}