Skip to content

Commit 8c0543d

Browse files
Copilotstephentoub
andcommitted
Add explanatory comments for threshold and heap capacity choices
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
1 parent 604bc90 commit 8c0543d

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

src/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ public static (int Id, int TokenIndex, int TokenLength)[] BytePairEncode(ReadOnl
2020
return [(ranks[mergingBytes], 0, 1)];
2121
}
2222

23+
// For large inputs, use heap-based algorithm to avoid O(n²) behavior.
24+
// Threshold of 128 chosen empirically: linear scan is cache-friendly for small inputs,
25+
// while heap overhead (O(log n) per operation) becomes worthwhile for larger inputs.
26+
// Based on upstream tiktoken using 100, adjusted upward for C#'s efficient span operations.
2327
if (mergingBytes.Length > 128)
2428
{
2529
return BytePairEncodeLarge(mergingBytes, ranks, indexMappingSpan);
@@ -166,6 +170,9 @@ private static (int Id, int TokenIndex, int TokenLength)[] BytePairEncodeLarge(R
166170
CurRank = int.MaxValue
167171
};
168172

173+
// Initial capacity: in the worst case, every adjacent pair is a valid merge candidate.
174+
// In practice, many pairs won't be in the vocabulary, so this over-allocates slightly,
175+
// but List resizing is cheap and this avoids multiple reallocations during initialization.
169176
var heap = new PriorityQueue<MergeEntry>(mergingBytes.Length - 1);
170177

171178
for (int i = 0; i < mergingBytes.Length - 1; i++)

0 commit comments

Comments
 (0)