Skip to content

Feature request: compact vocabulary remapping for domain-specific training with pretrained tokenizers #2023

@saslifat-gif

Description

@saslifat-gif

When fine-tuning or training small models on domain-specific corpora using pretrained tokenizers, most of the vocabulary goes unused. This causes a dilemma: use max(ids) and waste memory on a huge embedding table, or use len(set(ids)) and get index out of range errors. for a tiny model like mine, an oversized embedding table wastes memory and makes it harder for the optimizer to correctly update weights — slowing convergence and increasing the risk of overfitting.

only need 13k tokens

Image

if I directly using the max token idx that will make my model too large than I need and most of embedding tokens is useless

Image

and if I using the unique idx to train my model this will appear some out of range be cause some idx>the length of ids

Image

so I add a method to reindex the ids

Image

if needed I can open a pr let the method to a standalone method

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions