Vocabulary Extractor is a program to split any text into individual words, summarizing information about each unique word. The information is presented in the form of a tab-delimited matrix, so that the results can be easily copied and pasted into a spreadsheet program like Excel.
The program can be extended in three different ways: dictionaries, extra columns, and filtered words. Dictionaries can be changed by adding in extra files into certain directories. The distribution includes a copy of CC-CEDICT and VNEDICT, but alternative dictionaries can be used as a replacement or in combination.
The word summary after text analysis can be modified by adding extra word data files, which will be incorporated into the output as extra columns.
If you need to filter out words from the output (for example, to eliminate words already learned), word lists can be added, and will be used to filter out matching words.
Current version: Vocabulary_Extractor_0.9.0-Windows.zip (2026-04-30)
This project is hosted on GitHub, and the source tree can be cloned using Git tools.
After completing Windows setup below:
.\Build-Exe.ps1If the script is blocked because it is accessed via a UNC path (e.g. \\wsl.localhost\...), set the execution policy for the current session first:
Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope Process
.\Build-Exe.ps1This produces dist\Vocabulary Extractor\ containing Vocabulary Extractor.exe and all required data files. Zip that folder to distribute.
python -m venv venv
venv\Scripts\Activate.ps1
pip install -r requirements.txtIf you get the error:
File ...\venv\Scripts\Activate.ps1 cannot be loaded because running scripts is disabled on this system
Run the following command first, then re-run venv\Scripts\Activate.ps1:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserWhen prompted, choose Run Once or Always Run as appropriate.
python main.pytkinter is part of the Python standard library but requires a separate system package:
sudo apt install python3-tkThen set up the virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython main.pyThe --headless flag runs the program without any UI and writes tab-delimited results to a file or stdout. This is useful for scripting or batch processing.
# Output to stdout
python main.py --headless -i samples/VN/mytext.txt
# Output to a file
python main.py --headless -i samples/VN/mytext.txt -o results.tsvAdditional options let you specify dictionaries, charset, filters, and extra column data directly on the command line, overriding whatever is in the config file:
python main.py --headless \
-i samples/VN/mytext.txt \
-o results.tsv \
--dict dict/VN/vnedict.txt \
--charset Vietnamese \
--filter filter/VN/known-words.txt \
--extracolumn data/VN/Freq_per_Million.txtRun python main.py --help for the full list of options. See doc/help.html for detailed documentation.
See doc/help.html for detailed documentation.
This project is released under the GNU General Public License v3.0.
© 2026 zhtoolkit.com
