This script processes a folder of .txt files to count and analyze word occurrences, with optional tokenization using a T5 model (google/t5-v1_1-small). It saves the sorted word counts into a file named tokenizer_word_counts.txt for further use, such as training or analysis.
- Folder Processing: Scans all
.txtfiles in a specified folder. - Word Count: Counts words with at least 3 characters.
- Sorting: Outputs words sorted by frequency in descending order.
- Tokenization Integration: Includes optional debugging to tokenize words using a T5 model.
- Dynamic Input: Allows specifying the folder path via a command-line argument.
- Output: Saves results to a text file (
tokenizer_word_counts.txt).
- Python 3.7 or newer.
- Required packages:
transformersre
- Hugging Face Token:
- Ensure that you have a valid Hugging Face token set up.
- Login to Hugging Face CLI:
huggingface-cli login
- Download the T5 Model:
- The script automatically downloads the
google/t5-v1_1-smallmodel if not already available.
- The script automatically downloads the
Install missing dependencies using:
pip install transformers-
Run the script:
python script_name.py -p "path/to/your/folder"Replace
path/to/your/folderwith the absolute path to the folder containing.txtfiles. -
Output:
- A file named
tokenizer_word_counts.txtcontaining words and their frequencies, sorted by count.
- A file named
- Input:
- Folder containing
.txtfiles.
- Folder containing
- Processing:
- Reads text files, converts content to lowercase, and extracts words with at least 3 characters.
- (Optional) Tokenizes words using the
google/t5-v1_1-smallmodel for debugging purposes.
- Output:
- A sorted list of words and their counts in
tokenizer_word_counts.txt.
- A sorted list of words and their counts in
Content of tokenizer_word_counts.txt:
word1 123
word2 98
word3 45
...
- Ensure that the folder path provided contains
.txtfiles. - Use the
-pflag to specify a custom folder path. - Debugging outputs can be removed or commented out for production use.
- Missing Dependencies:
- Ensure
transformersis installed (pip install transformers).
- Ensure
- File Encoding Issues:
- The script assumes UTF-8 encoding for
.txtfiles.
- The script assumes UTF-8 encoding for
- Hugging Face Token:
- Login with
huggingface-cli loginbefore running the script.
- Login with
- Model Download Issues:
- Ensure the environment can download the
google/t5-v1_1-smallmodel from Hugging Face.
- Ensure the environment can download the
- Tokenizer Errors:
- Errors during tokenization are logged for debugging and do not interrupt execution.