This script processes a folder of .txt
files to count and analyze word occurrences, with optional tokenization using a T5 model (google/t5-v1_1-small
). It saves the sorted word counts into a file named tokenizer_word_counts.txt
for further use, such as training or analysis.
- Folder Processing: Scans all
.txt
files in a specified folder. - Word Count: Counts words with at least 3 characters.
- Sorting: Outputs words sorted by frequency in descending order.
- Tokenization Integration: Includes optional debugging to tokenize words using a T5 model.
- Dynamic Input: Allows specifying the folder path via a command-line argument.
- Output: Saves results to a text file (
tokenizer_word_counts.txt
).
- Python 3.7 or newer.
- Required packages:
transformers
re
- Hugging Face Token:
- Ensure that you have a valid Hugging Face token set up.
- Login to Hugging Face CLI:
huggingface-cli login
- Download the T5 Model:
- The script automatically downloads the
google/t5-v1_1-small
model if not already available.
- The script automatically downloads the
Install missing dependencies using:
pip install transformers
-
Run the script:
python script_name.py -p "path/to/your/folder"
Replace
path/to/your/folder
with the absolute path to the folder containing.txt
files. -
Output:
- A file named
tokenizer_word_counts.txt
containing words and their frequencies, sorted by count.
- A file named
- Input:
- Folder containing
.txt
files.
- Folder containing
- Processing:
- Reads text files, converts content to lowercase, and extracts words with at least 3 characters.
- (Optional) Tokenizes words using the
google/t5-v1_1-small
model for debugging purposes.
- Output:
- A sorted list of words and their counts in
tokenizer_word_counts.txt
.
- A sorted list of words and their counts in
Content of tokenizer_word_counts.txt
:
word1 123
word2 98
word3 45
...
- Ensure that the folder path provided contains
.txt
files. - Use the
-p
flag to specify a custom folder path. - Debugging outputs can be removed or commented out for production use.
- Missing Dependencies:
- Ensure
transformers
is installed (pip install transformers
).
- Ensure
- File Encoding Issues:
- The script assumes UTF-8 encoding for
.txt
files.
- The script assumes UTF-8 encoding for
- Hugging Face Token:
- Login with
huggingface-cli login
before running the script.
- Login with
- Model Download Issues:
- Ensure the environment can download the
google/t5-v1_1-small
model from Hugging Face.
- Ensure the environment can download the
- Tokenizer Errors:
- Errors during tokenization are logged for debugging and do not interrupt execution.