Tokenized-Word-Counter

Overview

This script processes a folder of .txt files to count and analyze word occurrences, with optional tokenization using a T5 model (google/t5-v1_1-small). It saves the sorted word counts into a file named tokenizer_word_counts.txt for further use, such as training or analysis.

Features

Folder Processing: Scans all .txt files in a specified folder.
Word Count: Counts words with at least 3 characters.
Sorting: Outputs words sorted by frequency in descending order.
Tokenization Integration: Includes optional debugging to tokenize words using a T5 model.
Dynamic Input: Allows specifying the folder path via a command-line argument.
Output: Saves results to a text file (tokenizer_word_counts.txt).

Requirements

Python 3.7 or newer.
Required packages:
- transformers
- re
Hugging Face Token:
- Ensure that you have a valid Hugging Face token set up.
- Login to Hugging Face CLI:
```
huggingface-cli login
```
Download the T5 Model:
- The script automatically downloads the google/t5-v1_1-small model if not already available.

Install missing dependencies using:

pip install transformers

Usage

Run the script:
```
python script_name.py -p "path/to/your/folder"
```
Replace path/to/your/folder with the absolute path to the folder containing .txt files.
Output:
- A file named tokenizer_word_counts.txt containing words and their frequencies, sorted by count.

Script Details

Input:
- Folder containing .txt files.
Processing:
- Reads text files, converts content to lowercase, and extracts words with at least 3 characters.
- (Optional) Tokenizes words using the google/t5-v1_1-small model for debugging purposes.
Output:
- A sorted list of words and their counts in tokenizer_word_counts.txt.

Example Output

Content of tokenizer_word_counts.txt:

word1 123
word2 98
word3 45
...

Notes

Ensure that the folder path provided contains .txt files.
Use the -p flag to specify a custom folder path.
Debugging outputs can be removed or commented out for production use.

Troubleshooting

Missing Dependencies:
- Ensure transformers is installed (pip install transformers).
File Encoding Issues:
- The script assumes UTF-8 encoding for .txt files.
Hugging Face Token:
- Login with huggingface-cli login before running the script.
Model Download Issues:
- Ensure the environment can download the google/t5-v1_1-small model from Hugging Face.
Tokenizer Errors:
- Errors during tokenization are logged for debugging and do not interrupt execution.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
tokenizer_word_counter_auto.py		tokenizer_word_counter_auto.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokenized-Word-Counter

Overview

Features

Requirements

Usage

Script Details

Example Output

Notes

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

yogotatara3/Tokenized-Word-Counter

Folders and files

Latest commit

History

Repository files navigation

Tokenized-Word-Counter

Overview

Features

Requirements

Usage

Script Details

Example Output

Notes

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages