The Unigram trainer's prune_sentence_pieces is a port of SentencePiece's PruneSentencePieces, but the loss computation diverges from it in one spot: it uses the size of the whole alternatives vector (i.e. the total number of pieces) where it should use the number of alternatives for the one piece being scored. This was reported back in #1536 and closed without a response; it's still on main, so I wanted to re-raise it with a reproduction.
The code
In tokenizers/src/models/unigram/trainer.rs (current main, 1da56fa670007969b3e173d14a40d7f75306fd78, https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L284), alternatives is the outer, per-piece vector:
let mut alternatives: Vec<Vec<usize>> = vec![Vec::new(); pieces.len()];
so alternatives.len() is just pieces.len(). alternatives[id] holds the pieces that id would be re-segmented into if it were removed (usually one or two of them). The loss line uses the wrong one (https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L402):
// After removing the sentencepiece[i], its frequency freq[i] is
// re-assigned to alternatives.
// new_sum = current_sum - freq[i] + freq[i] * alternatives.size()
// = current_sum + freq[i] (alternatives - 1)
let logsum_alt = (sum + freq[id] * (alternatives.len() - 1) as f64).ln();
The comment is the giveaway. It's copied from SentencePiece, where the corresponding lines read (src/unigram_model_trainer.cc, currently 3993c81443ba19e701e4eb62aa79b4e50259becd, https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc#L534):
// new_sum = current_sum - freq[i] + freq[i] * alternatives[i].size()
const float logsum_alt = std::log(
static_cast<double>(sum + freq[i] * (alternatives[i].size() - 1)));
SentencePiece uses alternatives[i].size(); the comment here says alternatives.size() (the [i] got dropped when it was ported), and the code followed the comment. It should be alternatives[id].len().
Why it matters
logsum_alt is the log of the corpus token total after removing piece id and handing its frequency to its alternatives. That total should only grow by freq[id] * (number of alternatives − 1). Using the total piece count instead multiplies that growth term by the whole vocabulary size, which inflates the piece's loss. Pieces are kept in order of descending loss. Because the inflation scales with freq[id], it systematically over-values high-frequency pieces when deciding what to keep.
A small concrete case (sum=1000, freq[id]=10, one alternative with freq=5, 500 pieces total): the correct loss is −0.0041, the buggy loss is +0.0139... a sign flip and ~3.4×.
Does it actually change the trained vocabulary? Yes.
I wasn't sure at first whether this washed out downstream (the EM step and finalize both drop low-scoring pieces afterward), so I compared the two formulas end-to-end through the public Python API (Tokenizer(Unigram()) + UnigramTrainer(...).train_from_iterator(...)) on real text: Pride and Prejudice (Project Gutenberg 1342), tokenized into words. The trained vocabularies differ, deterministically (and the Rust trainer gives the identical result):
| vocab_size |
shrinking_factor |
result |
| 1000 |
0.9 |
12 pieces differ (Jaccard 0.976) |
| 2000 |
0.9 |
9 pieces differ (Jaccard 0.991) |
| 4000 |
0.8 |
identical |
At vocab_size=1000 the swapped entries are mostly whole words: the correct version keeps commendation, expectation, conscious, prefer, rank; the buggy one keeps admit, affected, ball, friends, mere instead. It's the same story on synthetic corpora with heavy substring overlap (Jaccard 0.85–0.98 depending on the size). The effect is biggest when pruning does most of the work (seed vocab ≫ target) and disappears when there's little pruning to do.. which is why it can look inert on some corpora but clearly isn't in general.
So: not a crash, not "broken vocabularies" but a real divergence from the reference algorithm that changes which pieces a trained tokenizer ends up with.
The fix
- let logsum_alt = (sum + freq[id] * (alternatives.len() - 1) as f64).ln();
+ let logsum_alt = (sum + freq[id] * (alternatives[id].len() - 1) as f64).ln();
This branch only runs when alternatives[id] is non-empty, so the - 1 can't underflow.
One thing I noticed while digging in: prune_sentence_pieces has no test coverage today. test_unigram_chars only touches the seeding stage, and the training tests use a tiny 2-word corpus that breaks out of the EM loop before pruning ever runs so nothing exercises this path, which is probably why it slipped through. I'm happy to add a small unit test for it alongside a PR.
Prior report
This is the same thing flagged in #1536 (2024), which suggested exactly this fix and was closed as not planned with no reply. Re-filing with the SentencePiece reference pinned and the reproduction above.
The Unigram trainer's
prune_sentence_piecesis a port of SentencePiece'sPruneSentencePieces, but the loss computation diverges from it in one spot: it uses the size of the wholealternativesvector (i.e. the total number of pieces) where it should use the number of alternatives for the one piece being scored. This was reported back in #1536 and closed without a response; it's still onmain, so I wanted to re-raise it with a reproduction.The code
In
tokenizers/src/models/unigram/trainer.rs(currentmain,1da56fa670007969b3e173d14a40d7f75306fd78, https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L284),alternativesis the outer, per-piece vector:so
alternatives.len()is justpieces.len().alternatives[id]holds the pieces thatidwould be re-segmented into if it were removed (usually one or two of them). The loss line uses the wrong one (https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L402):The comment is the giveaway. It's copied from SentencePiece, where the corresponding lines read (
src/unigram_model_trainer.cc, currently3993c81443ba19e701e4eb62aa79b4e50259becd, https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc#L534):SentencePiece uses
alternatives[i].size(); the comment here saysalternatives.size()(the[i]got dropped when it was ported), and the code followed the comment. It should bealternatives[id].len().Why it matters
logsum_altis the log of the corpus token total after removing pieceidand handing its frequency to its alternatives. That total should only grow byfreq[id] * (number of alternatives − 1). Using the total piece count instead multiplies that growth term by the whole vocabulary size, which inflates the piece's loss. Pieces are kept in order of descending loss. Because the inflation scales withfreq[id], it systematically over-values high-frequency pieces when deciding what to keep.A small concrete case (
sum=1000,freq[id]=10, one alternative withfreq=5, 500 pieces total): the correct loss is−0.0041, the buggy loss is+0.0139... a sign flip and ~3.4×.Does it actually change the trained vocabulary? Yes.
I wasn't sure at first whether this washed out downstream (the EM step and
finalizeboth drop low-scoring pieces afterward), so I compared the two formulas end-to-end through the public Python API (Tokenizer(Unigram())+UnigramTrainer(...).train_from_iterator(...)) on real text: Pride and Prejudice (Project Gutenberg 1342), tokenized into words. The trained vocabularies differ, deterministically (and the Rust trainer gives the identical result):At
vocab_size=1000the swapped entries are mostly whole words: the correct version keepscommendation,expectation,conscious,prefer,rank; the buggy one keepsadmit,affected,ball,friends,mereinstead. It's the same story on synthetic corpora with heavy substring overlap (Jaccard 0.85–0.98 depending on the size). The effect is biggest when pruning does most of the work (seed vocab ≫ target) and disappears when there's little pruning to do.. which is why it can look inert on some corpora but clearly isn't in general.So: not a crash, not "broken vocabularies" but a real divergence from the reference algorithm that changes which pieces a trained tokenizer ends up with.
The fix
This branch only runs when
alternatives[id]is non-empty, so the- 1can't underflow.One thing I noticed while digging in:
prune_sentence_pieceshas no test coverage today.test_unigram_charsonly touches the seeding stage, and the training tests use a tiny 2-word corpus that breaks out of the EM loop before pruning ever runs so nothing exercises this path, which is probably why it slipped through. I'm happy to add a small unit test for it alongside a PR.Prior report
This is the same thing flagged in #1536 (2024), which suggested exactly this fix and was closed as not planned with no reply. Re-filing with the SentencePiece reference pinned and the reproduction above.