Algorithms and models for Bilibili Search Engine (blbl.top).
Extract English words from video texts (titles, tags and desc):
python -m models.word.eng -ec -en -mf 6Extract Chinese words from video texts (tags):
python -m models.word.eng -ec -zh -mf 6See more example usages in comments of eng.py.
Train sentencepiece model from video texts.
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -ec -vs 400000 -cc 0.9995 -sf 0.9 -eor use train.sh with pre-grouped regions and pre-defined input-output paths:
./models/sentencepiece/train.sh 1
./models/sentencepiece/train.sh 2
./models/sentencepiece/train.sh 3
./models/sentencepiece/train.sh 4
./models/sentencepiece/train.sh rTest:
python -m models.sentencepiece.train -m sp_507m_400k_0.9995_0.9 -tSee more example usages in comments of train.py.
Merge sentencepiece models which are trained on different data_utils.
python -m models.sentencepiece.mergeor with more params:
python -m models.sentencepiece.merge -vs 1000000 -i sp_518m_ -o sp_mergedSee more example usages in comments of merge.py.
Convert sentencepiece vocab to txt:
python -m models.sentencepiece.convert ...See more example usages in comments of convert.py.
Tokenize video texts from database, and save to parquets. Used by data_utils.videos.freq and models.fasttext.train.
python -m data_utils.videos.cache -ec -dn video_texts_tid_all -fw 200 -bw 100 -bs 10000or use cache.sh with pre-grouped regions and pre-defined input-output paths:
./data_utils/videos/cache.sh 1
./data_utils/videos/cache.sh 2Count video terms freqs from database or parquets (-ds) with dataset name (-dn) and filters (-td, -pd, -fg), and save to csv and pickle with prefix (-o). Used by models.fasttext.train.
Specify region tid:
python -m data_utils.videos.freq -o video_texts_freq_tid_17_nt -dn "video_texts_tid_17" -td 17 -ntAll regions:
python -m data_utils.videos.freq -o video_texts_freq_tid_all_nt -dn "video_texts_tid_all" -ntor use freq.sh with pre-grouped regions and pre-defined input-output paths:
./data_utils/videos/freq.sh 1
./data_utils/videos/freq.sh 2Merge token freqs with total vocab size limit (-mv) from different regions (generated by freq.sh), and save to csv with prefix (-o).
python -m models.fasttext.vocab -mv 1000000 -o merged_video_textsTrain FastText model from video texts, with (merged) vocab freqs (.csv) and cached tokens (.parquet).
python -m models.fasttext.train -m fasttext_other_game_vf_merged_csv -ep 1 -dr "parquets" -dn "video_texts_other_game" -vf "merged_video_texts" -vl csv -bs 20000 -mv 900000Add pos tags to original tokens freqs data (merged_video_texts.csv), and save to new csv file (merged_video_texts_pos.csv).
python -m models.fasttext.posRun fasttext model (word or doc) as server or client.
-m: model_prefix, default isfasttext_merged-v: vocab_limit, default is150000-w: vector weighted, must specify to enable-r: run as server-ms: model class, default isword, but recommend to usedocfor most cases-l: list models-t: test model-tc: test client
List models:
python -m models.fasttext.run -lTest models:
python -m models.fasttext.run -t -m fasttext_tid_all_mv_60w -v 150000Run as server:
python -m models.fasttext.run -r -m fasttext_merged -v 150000 -w
python -m models.fasttext.run -r -ms doc -m fasttext_merged -v 150000 -wTest remote client:
python -m models.fasttext.run -tc
python -m models.fasttext.run -tc -ms doc