Unigram trainer: prune loss uses `alternatives.len()` instead of `alternatives[id].len()`

The Unigram trainer's `prune_sentence_pieces` is a port of SentencePiece's `PruneSentencePieces`, but the loss computation diverges from it in one spot: it uses the size of the whole `alternatives` vector (i.e. the total number of pieces) where it should use the number of alternatives for the *one* piece being scored. This was reported back in #1536 and closed without a response; it's still on `main`, so I wanted to re-raise it with a reproduction.

## The code

In `tokenizers/src/models/unigram/trainer.rs` (current `main`, `1da56fa670007969b3e173d14a40d7f75306fd78`, https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L284), `alternatives` is the outer, per-piece vector:

```rust
let mut alternatives: Vec<Vec<usize>> = vec![Vec::new(); pieces.len()];
```

so `alternatives.len()` is just `pieces.len()`. `alternatives[id]` holds the pieces that `id` would be re-segmented into if it were removed (usually one or two of them). The loss line uses the wrong one (https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/unigram/trainer.rs#L402):

```rust
// After removing the sentencepiece[i], its frequency freq[i] is
// re-assigned to alternatives.
// new_sum = current_sum - freq[i] + freq[i] * alternatives.size()
//         = current_sum + freq[i] (alternatives - 1)
let logsum_alt = (sum + freq[id] * (alternatives.len() - 1) as f64).ln();
```

The comment is the giveaway. It's copied from SentencePiece, where the corresponding lines read (`src/unigram_model_trainer.cc`, currently `3993c81443ba19e701e4eb62aa79b4e50259becd`, https://github.com/google/sentencepiece/blob/master/src/unigram_model_trainer.cc#L534):

```cpp
// new_sum = current_sum - freq[i] + freq[i] * alternatives[i].size()
const float logsum_alt = std::log(
    static_cast<double>(sum + freq[i] * (alternatives[i].size() - 1)));
```

SentencePiece uses `alternatives[i].size()`; the comment here says `alternatives.size()` (the `[i]` got dropped when it was ported), and the code followed the comment. It should be `alternatives[id].len()`.

## Why it matters

`logsum_alt` is the log of the corpus token total *after* removing piece `id` and handing its frequency to its alternatives. That total should only grow by `freq[id] * (number of alternatives − 1)`. Using the total piece count instead multiplies that growth term by the whole vocabulary size, which inflates the piece's loss. Pieces are kept in order of descending loss. Because the inflation scales with `freq[id]`, it systematically over-values high-frequency pieces when deciding what to keep.

A small concrete case (`sum=1000`, `freq[id]=10`, one alternative with `freq=5`, 500 pieces total): the correct loss is `−0.0041`, the buggy loss is `+0.0139`... a sign flip and ~3.4×.

## Does it actually change the trained vocabulary? Yes.

I wasn't sure at first whether this washed out downstream (the EM step and `finalize` both drop low-scoring pieces afterward), so I compared the two formulas end-to-end through the public Python API (`Tokenizer(Unigram())` + `UnigramTrainer(...).train_from_iterator(...)`) on real text: *Pride and Prejudice* (Project Gutenberg 1342), tokenized into words. The trained vocabularies differ, deterministically (and the Rust trainer gives the identical result):

| vocab_size | shrinking_factor | result |
|---|---|---|
| 1000 | 0.9 | 12 pieces differ (Jaccard 0.976) |
| 2000 | 0.9 | 9 pieces differ (Jaccard 0.991) |
| 4000 | 0.8 | identical |

At `vocab_size=1000` the swapped entries are mostly whole words: the correct version keeps `commendation`, `expectation`, `conscious`, `prefer`, `rank`; the buggy one keeps `admit`, `affected`, `ball`, `friends`, `mere` instead. It's the same story on synthetic corpora with heavy substring overlap (Jaccard 0.85–0.98 depending on the size). The effect is biggest when pruning does most of the work (seed vocab ≫ target) and disappears when there's little pruning to do.. which is why it can look inert on some corpora but clearly isn't in general.

So: not a crash, not "broken vocabularies" but a real divergence from the reference algorithm that changes which pieces a trained tokenizer ends up with.

## The fix

```diff
- let logsum_alt = (sum + freq[id] * (alternatives.len() - 1) as f64).ln();
+ let logsum_alt = (sum + freq[id] * (alternatives[id].len() - 1) as f64).ln();
```

This branch only runs when `alternatives[id]` is non-empty, so the `- 1` can't underflow.

One thing I noticed while digging in: `prune_sentence_pieces` has no test coverage today. `test_unigram_chars` only touches the seeding stage, and the training tests use a tiny 2-word corpus that breaks out of the EM loop before pruning ever runs so nothing exercises this path, which is probably why it slipped through. I'm happy to add a small unit test for it alongside a PR.

## Prior report

This is the same thing flagged in #1536 (2024), which suggested exactly this fix and was closed as not planned with no reply. Re-filing with the SentencePiece reference pinned and the reproduction above.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unigram trainer: prune loss uses `alternatives.len()` instead of `alternatives[id].len()` #2069

The code

Why it matters

Does it actually change the trained vocabulary? Yes.

The fix

Prior report

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vocab_size	shrinking_factor	result
1000	0.9	12 pieces differ (Jaccard 0.976)
2000	0.9	9 pieces differ (Jaccard 0.991)
4000	0.8	identical

Unigram trainer: prune loss uses alternatives.len() instead of alternatives[id].len() #2069

Description

The code

Why it matters

Does it actually change the trained vocabulary? Yes.

The fix

Prior report

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Unigram trainer: prune loss uses `alternatives.len()` instead of `alternatives[id].len()` #2069