Skip to content

Expose Encoder in TiktokenTokenizer #7313

Open
@razshare

Description

Hello, first of all thank your very much for this project!

Is your feature request related to a problem? Please describe.
Yes, it is.
Some of our clients may have outdated encodings on their client application.
We still want our clients to have access to new encodings even if their client application is not up to date, hence we want to serve the encoder dictionary from a server endpoint.

A clear and concise description of what the problem is.
The problem is that, currently, the Encoder property in TiktokenTokenizer is internal.

/// <summary>
/// Gets the dictionary mapping token bytes to Ids.
/// </summary>
internal IReadOnlyDictionary<ReadOnlyMemory<byte>, int> Encoder => _encoder;

Describe the solution you'd like
I would like to expose this Encoder property.
There seems to be the intent to expose this property at some point in the future.

// We are not exposing the Encoder, Decoder, or Vocabulary so far. For now, use reflection to test it.
private static IReadOnlyDictionary<ReadOnlyMemory<byte>, int>? GetEncoder(TiktokenTokenizer tiktoken)
=> typeof(TiktokenTokenizer).GetProperty("Encoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<ReadOnlyMemory<byte>, int>;
private static IReadOnlyDictionary<int, ReadOnlyMemory<byte>>? GetDecoder(TiktokenTokenizer tiktoken)
=> typeof(TiktokenTokenizer).GetProperty("Decoder", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<int, ReadOnlyMemory<byte>>;
private static IReadOnlyDictionary<string, int>? GetVocabulary(TiktokenTokenizer tiktoken)
=> typeof(TiktokenTokenizer).GetProperty("Vocabulary", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(tiktoken) as IReadOnlyDictionary<string, int>;

Maybe this is the time to do it, what do you think?

Describe alternatives you've considered
Maybe a separate method that does exactly what that test from above does using reflection.
Sounds like overkill and a lot of overhead though.
Exposing the property is probably the best way to deal with this.

Additional context
I'm sending a PR your way with the changes, feel free to ask for/make any modifications you think are necessary.

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions