Skip to content

The entry of \n in vocab.txt is causing token index shifting #64

Description

@hiroshi-matsuda-rit

It seems \n is causing token index shifting after the line 10295 in vocab.txt.

$ less -N vocab.txt
...
  10294 ##錄
  10295 
  10296 
  10297 する

Fortunately, I did not find any performance degrading in downstream tasks caused by this index shifting, but got an error message while executing save_pretrained().
https://github.com/huggingface/transformers/blob/v4.28.1/src/transformers/models/bert/tokenization_bert.py#L357

Saving vocabulary to vocab.txt: vocabulary indices are not consecutive. Please check that the vocabulary is not corrupted!

The line 10295 in vocab.txt should be some non-existent word like !!!DIFECTED!!!, I think.

Also see #57.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions