It seems \n is causing token index shifting after the line 10295 in vocab.txt.
$ less -N vocab.txt
...
10294 ##錄
10295
10296
10297 する
Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!
The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.
Also see #57.
It seems
\nis causing token index shifting after the line 10295 invocab.txt.Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing
save_pretrained().https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357
The line 10295 in
vocab.txtshould be some non-existent word like!!!DIFECTED!!!, I think.Also see #57.