This project focuses on analyzing and categorizing English words based on their CEFR levels (from A1 to C2). I computed CEFR levels for every valid English word and its corresponding part of speech by by considering various factors including the average levels of other parts of speech for the same word, lemma levels, stem levels, as well as lemma, stem, and word frequencies.
Words CEFR Dataset now integrated into the cefrpy python module! With a size of 900 KB, cefrpy is your go-to resource for independent use or along with spaCy, empowering you to exclude Named Entities (cities, countries, people names, etc) effortlessly from CEFR analysis.
Here is a demo:
To perform a basic word CEFR analysis, execute the Text-Analizer.ipynb file. This notebook provides a practical demonstration of how to analyze words and determine their corresponding CEFR levels.
In the heart of every forest, a hidden world thrives among the towering trees. Trees, those silent giants, are more than just passive observers of nature's drama; they are active participants in an intricate dance of life.
Did you know that trees communicate with each other? It's not through words or gestures like ours, but rather through a complex network of fungi that connect their roots underground. This network, often called the "wood wide web," allows trees to share nutrients, water, and even warnings about potential threats.
But trees are not just generous benefactors; they are also masters of adaptation. Take the mighty sequoias, for example, towering giants that have stood the test of time for thousands of years. These giants have evolved thick, fire-resistant bark to withstand the frequent wildfires of their native California.
And speaking of longevity, did you know that some trees have been around for centuries, witnessing history unfold? The ancient bristlecone pines of the American West, for instance, can live for over 5,000 years, making them some of the oldest living organisms on Earth.
So the next time you find yourself wandering through a forest, take a moment to appreciate the remarkable world of trees. They may seem like silent spectators, but their lives are full of fascinating stories waiting to be discovered.
NLP: 318 ms
CEFR levels: 3 ms
Text length: 1370
Total tokens: 275
CEFR statistic (total words):
A1: 136
A2: 37
B1: 27
B2: 11
C1: 2
C2: 7
CEFR statistic (unique words):
A1: 69
A2: 34
B1: 23
B2: 11
C1: 2
C2: 7
Not found words: 0
Words with level B2 and higher: 17
mighty JJ 4.00 B2
potential JJ 4.00 B2
bristlecone NN 6.00 C2
living NN 4.00 B2
longevity NN 5.97 C2
california NNP 6.00 C2
benefactors NNS 6.00 C2
fungi NNS 5.19 C1
masters NNS 4.00 B2
observers NNS 4.00 B2
pines NNS 4.00 B2
sequoias NNS 6.00 C2
wildfires NNS 6.00 C2
underground RB 4.00 B2
withstand VB 5.12 C1
evolved VBN 4.00 B2
thrives VBZ 5.86 C2
For this project I created a valid English words list. It includes word, frequency count and word stems along with their associated probabilities of being valid words. For this project I used valid_words_sorted_by_frequency.csv file, so all words in words table in SQLite database are sorted by frequency.
The data processing pipeline, implemented in the Word-CEFR.ipynb notebook, involves the following steps:
Parsing Google 1-grams Dataset: Extracting frequency data for valid words part of speech with frequency counts more than 10,000. I used spaCy for more accurate part-of-speech (POS) tagging based on the Penn Treebank Project's list of POS tags. Additionally, I used LemmInflect for obtaining word lemmas.
Parsing CEFR-J Dataset: I parsed it to get CEFR level for some words based on its POS. In this step I also parsed core usage categories.
Calculation of Average Frequencies for each CEFR level and Interpolation.
Assigning CEFR Levels: Determining the CEFR level for each word's POS based on average levels of other POS for the same word, lemma levels, stem levels, as well as lemma, stem and word frequencies.
Database Optimization: Minimizing database size by consolidating word frequency data from 1900 to 2019 into a single total value. Additionally, calculating the average POS level from all available sources. SQLite database is now optimized and has a reduced size of 20MB. Refer to the Minify_db.ipynb file for more details.
Incorporating Additional Datasets: To obtain more precise data, consider parsing the Octanove Vocabulary profile dataset, which provides C1 and C2 level vocabulary data. However, please note that this dataset is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. Also you can parse World level survey by Zenodo dataset to further enrich the dataset. This dataset, licensed under the Creative Commons Attribution 4.0 International license.
Filtering Personal Names and Geographical Entities: you can improve result accuracy by implementing mechanisms to identify and exclude personal names, countries, cities, and other such entities from displaying CEFR levels. This refinement can help ensure that the analysis focuses solely on linguistic content. Already done in my cefrpy project.
- word_cefr_minified.db: SQLite3 database.
- word_id
- word
- stem_word_id
- word_pos_id
- word_id
- pos_tag_id
- lemma_word_id
- frequency_count
- level
- word_pos_id
- category_id
- tag_id
- tag
- description
- category_id
- category_title
This project is licensed under the MIT License - see the LICENSE file for details.
I would like to acknowledge the contributions of the following resources:
- Spacy
- LemmInflect
- The Google Books Ngram Viewer (used 1-grams dataset, version 20200217)
- List of pos tags form Penn Treebank Project
Also I used these resources to create my valid English words list: