Skip to content

Commit 8984a3d

Browse files
Copilottarekgh
andcommitted
Add comment noting that CurRank assumes rank == token Id (Tiktoken-specific)
Co-authored-by: tarekgh <10833894+tarekgh@users.noreply.github.com>
1 parent bd86b94 commit 8984a3d

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

src/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,9 @@ private struct State
131131
public int End;
132132
public int NextEnd;
133133
public int NextRank;
134+
// Note: In the Tiktoken tokenizer, the rank is also the token Id.
135+
// This field is used to cache the rank/Id after a merge so we don't need to re-look it up.
136+
// Using this code with a different tokenizer where rank != token Id would produce wrong results.
134137
public int CurRank;
135138
}
136139

0 commit comments

Comments
 (0)