Skip to content

Conversation

@nico-martin
Copy link
Collaborator

I compared our implementation with https://huggingface.co/docs/tokenizers/quicktour and it seems like they return not only the token_ids, but also the attention_mask and the tokens.
I think we should adapt it so that our solution is the same as the Python implementation.

I also added function overloads for the return_token_type_ids so we have more percise return types based on the config.

@nico-martin nico-martin requested a review from xenova November 3, 2025 08:01
Copy link
Collaborator

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This looks good to me.

It might also be good to see how the rust library handles or delegates padding/truncation of batched inputs to the python library, since we might need to move this to transformers.js

e.g., there is encode_batch (https://huggingface.co/docs/tokenizers/api/encode-inputs)

@xenova xenova merged commit b6e12b6 into main Nov 3, 2025
5 checks passed
@xenova xenova deleted the export-attention-mask branch November 3, 2025 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants