changed the encode method and added overloads #3

nico-martin · 2025-11-03T08:01:32Z

I compared our implementation with https://huggingface.co/docs/tokenizers/quicktour and it seems like they return not only the token_ids, but also the attention_mask and the tokens.
I think we should adapt it so that our solution is the same as the Python implementation.

I also added function overloads for the return_token_type_ids so we have more percise return types based on the config.

xenova

Nice! This looks good to me.

It might also be good to see how the rust library handles or delegates padding/truncation of batched inputs to the python library, since we might need to move this to transformers.js

e.g., there is encode_batch (https://huggingface.co/docs/tokenizers/api/encode-inputs)

changed the encode method and added overloads

5846be2

nico-martin requested a review from xenova November 3, 2025 08:01

updated readme

2b4230c

xenova approved these changes Nov 3, 2025

View reviewed changes

xenova merged commit b6e12b6 into main Nov 3, 2025
5 checks passed

xenova deleted the export-attention-mask branch November 3, 2025 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

changed the encode method and added overloads #3

changed the encode method and added overloads #3

Uh oh!

nico-martin commented Nov 3, 2025

Uh oh!

xenova left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

changed the encode method and added overloads #3

changed the encode method and added overloads #3

Uh oh!

Conversation

nico-martin commented Nov 3, 2025

Uh oh!

xenova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xenova left a comment •

edited

Loading