Skip to content

Cannot recognize <|endoftext|> #24

Open
@wangkuiyi

Description

@wangkuiyi

Thank you for this project! It is very helpful for me to understand how GPT2 synthesize text.

I also noticed that the GPT2/encoder.py does not implement the capability of recognizing special tokens as the HuggingFace tokenzier could.

The part of source code in HuggingFace's repo is at https://github.com/huggingface/transformers/blob/c836f77266be9ace47bff472f63caf71c0d11333/src/transformers/tokenization_utils.py#L516-L520

I understand that it is not critical, because there is only one special token <|endoftext|> in use wangkuiyi/huggingface-tokenizer-in-cxx#11

So, just saying.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions