Preprocess

Jump to bottom

Karahan Sarıtaş edited this page Mar 24, 2023 · 1 revision

To learn the embedding vectors for Turkish words, we have to use a corpus. Put your corpus file into the working directory.
If they are 7-zip files, you can first use the corresponding script to convert them into txt file. For example, to convert wiki.tr.txt.7z file, use the following command. (--output indicates the output folder for your txt file to be stored. If you do not specify it, the output file will be stored in the working directory.):
```
python preprocess/7z_to_txt.py --input wiki.tr.txt.7z --output wiki.tr.txt
```
If the txt version includes redundant lines, you can format your txt file using the txt_formatter script, which basically re-creates the file using the provided stride and offset values. For example, to format the wiki.tr.txt.7z file, you can use the following command:
```
python preprocess/txt_formatter.py -i wiki.tr.txt.7z -s 4 -f 1 
```
If not provided, --output defaults to the input file. --stride and --offset default to 1 and 0, respectively. stride stands for the number of lines to skip between consecutive sentences and offset stands for the number of lines to skip at the beginning of the file.
Additionally, you can use the analyzer script to get vocabulary size and maximum sequence length in your corpus. For example, to get the vocabulary size and maximum sequence length of the wiki.tr.txt.7z file, you can use the following command:
```
python preprocess/txt_analyzer.py -i wiki.tr.txt.7z
```