Skip to content

weezymatt/generate-emoji-list

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

generate-emoji-list

Very simple program to generate list of emoji's and their textual description. The list can be used for various tasks but it intended for natural language processing tasks (e.g., sentiment analysis with tweets).

All emoji-data is pulled from the official Unicode Emoji list and converted into a .py file.

Several precompiled lists can be found in the emoji_list folder, including unicode (\u231a) and emoji (⌚) representations.

Using to generate a list

The lists can be generated by running either the .sh or running the .py with arguments.

  1. Running the .sh file with preconfigured arguments:
    chmod +x run.sh  # give permissions
    ./run.sh 
  2. Running the .py file:
       python3 run.py \
        	--preprocess yes \
        	--file unicode_data_files/emoji-sequences.txt \
        	--render_unicode False \
        	--save_path emoji_list/emoji-sequence_v16.py
    You are given the option to render emojis in unicode or emoji representation and only need to modify the --render_unicode arg.

Why?

I wrote this program because I needed to preprocess some text that has emojis and convert them into their textual descriptions. I briefly searched if something similar existed but I didn't find anything that can be used automatically for the following releases. The current release is dated from 2024-08-25 under Emoji v16.0.

Emoji 17.0 is the presumed release which will provide new emojis alongside Unicode 17.0 in September 2025.

The paper that inspired the program is cited below:

@inproceedings{singh-etal-2019-incorporating,
    title = "Incorporating Emoji Descriptions Improves Tweet Classification",
    author = "Singh, Abhishek  and
      Blanco, Eduardo  and
      Jin, Wei",
    editor = "Burstein, Jill  and
      Doran, Christy  and
      Solorio, Thamar",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N19-1214/",
    doi = "10.18653/v1/N19-1214",
    pages = "2096--2101",
    abstract = "Tweets are short messages that often include specialized language such as hashtags and emojis. In this paper, we present a simple strategy to process emojis: replace them with their natural language description and use pretrained word embeddings as normally done with standard words. We show that this strategy is more effective than using pretrained emoji embeddings for tweet classification. Specifically, we obtain new state-of-the-art results in irony detection and sentiment analysis despite our neural network is simpler than previous proposals."
}

About

Generates a list of emojis rendered in text (unicode) or in emoji format.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages