Feature request: compact vocabulary remapping for domain-specific training with pretrained tokenizers

When fine-tuning or training small models on domain-specific corpora using pretrained tokenizers, most of the vocabulary goes unused. This causes a dilemma: use max(ids) and waste memory on a huge embedding table, or use len(set(ids)) and get index out of range errors. for a tiny model like mine, an oversized embedding table wastes memory and makes it harder for the optimizer to correctly update weights — slowing convergence and increasing the risk of overfitting.

**only need 13k tokens**

<img width="1670" height="526" alt="Image" src="https://github.com/user-attachments/assets/84e1f798-cebc-43f4-abd5-752b4690cb24" />

**if I directly using the max token idx that will make my model too large than I need and most of embedding tokens is useless**

<img width="1608" height="712" alt="Image" src="https://github.com/user-attachments/assets/e5ef5d82-7d6a-4da1-96e9-2354bccce5eb" />

**and if I using the unique idx to train my model this will appear some out of range be cause some idx>the length of ids**

<img width="1446" height="730" alt="Image" src="https://github.com/user-attachments/assets/1a2e55a4-6579-4ac3-a248-157e9003b79b" />

**so I add a method to reindex the ids**

<img width="800" height="416" alt="Image" src="https://github.com/user-attachments/assets/efd44fdc-8c37-403a-a8b6-d0e253c09f0a" />

**if needed I can open a pr let the method to a standalone method**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: compact vocabulary remapping for domain-specific training with pretrained tokenizers #2023

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature request: compact vocabulary remapping for domain-specific training with pretrained tokenizers #2023

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions