When fine-tuning or training small models on domain-specific corpora using pretrained tokenizers, most of the vocabulary goes unused. This causes a dilemma: use max(ids) and waste memory on a huge embedding table, or use len(set(ids)) and get index out of range errors. for a tiny model like mine, an oversized embedding table wastes memory and makes it harder for the optimizer to correctly update weights — slowing convergence and increasing the risk of overfitting.
only need 13k tokens
if I directly using the max token idx that will make my model too large than I need and most of embedding tokens is useless
and if I using the unique idx to train my model this will appear some out of range be cause some idx>the length of ids
so I add a method to reindex the ids
if needed I can open a pr let the method to a standalone method
When fine-tuning or training small models on domain-specific corpora using pretrained tokenizers, most of the vocabulary goes unused. This causes a dilemma: use max(ids) and waste memory on a huge embedding table, or use len(set(ids)) and get index out of range errors. for a tiny model like mine, an oversized embedding table wastes memory and makes it harder for the optimizer to correctly update weights — slowing convergence and increasing the risk of overfitting.
only need 13k tokens
if I directly using the max token idx that will make my model too large than I need and most of embedding tokens is useless
and if I using the unique idx to train my model this will appear some out of range be cause some idx>the length of ids
so I add a method to reindex the ids
if needed I can open a pr let the method to a standalone method