-
Notifications
You must be signed in to change notification settings - Fork 1
Preprocess
Karahan Sarıtaş edited this page Mar 24, 2023
·
1 revision
- To learn the embedding vectors for Turkish words, we have to use a corpus. Put your corpus file into the working directory.
- If they are
7-zipfiles, you can first use the corresponding script to convert them intotxtfile. For example, to convertwiki.tr.txt.7zfile, use the following command. (--outputindicates the output folder for yourtxtfile to be stored. If you do not specify it, the output file will be stored in the working directory.):python preprocess/7z_to_txt.py --input wiki.tr.txt.7z --output wiki.tr.txt
- If the
txtversion includes redundant lines, you can format yourtxtfile using thetxt_formatterscript, which basically re-creates the file using the provided stride and offset values. For example, to format thewiki.tr.txt.7zfile, you can use the following command:If not provided,python preprocess/txt_formatter.py -i wiki.tr.txt.7z -s 4 -f 1
--outputdefaults to the input file.--strideand--offsetdefault to 1 and 0, respectively. stride stands for the number of lines to skip between consecutive sentences and offset stands for the number of lines to skip at the beginning of the file. - Additionally, you can use the
analyzerscript to get vocabulary size and maximum sequence length in your corpus. For example, to get the vocabulary size and maximum sequence length of thewiki.tr.txt.7zfile, you can use the following command:python preprocess/txt_analyzer.py -i wiki.tr.txt.7z