See https://huggingface.co/TsinghuaAI/CPM-Generate/discussions/1
For LM fine-tuning or generation, how do I prepare my input data?
[token_id_1, token_id_2, ..., eod_token_id]
, where eod_token_id
is the id of <eod>
token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id]
, where eos_token_id
is the id of </s>
token in transformers.CpmTokenizer
[token_id_1, token_id_2, ..., eos_token_id]
, where eos_token_id
is the id of <|endoftext|>
token in transformers.GPT2Tokenizer
[token_id_1, token_id_2, ..., sep_token_id, cls_token_id]
, just call CpmTokenizer