Add component exports matching Python tokenizers library structure #10

Copilot · 2025-11-04T20:40:07Z

Plan to Add Additional Exports

Final Implementation

All components are now exported from the main @huggingface/tokenizers export with unique names:

Pre-tokenizers: ByteLevelPreTokenizer, MetaspacePreTokenizer, PreTokenizerSequence, etc.
Models: BPE, Unigram, WordPiece, Model
Normalizers: NormalizerSequence, Strip, Replace, etc.
Decoders: ByteLevelDecoder, MetaspaceDecoder, DecoderSequence, etc.
Post-processors: ByteLevelPostProcessor, PostProcessorSequence, etc.

The export chain is now clean:

Submodule files (e.g., decoders.ts) export with final names: export { default as ByteLevelDecoder } from "./core/decoder/ByteLevelDecoder"
Main index.ts re-exports without renaming: export { ByteLevelDecoder } from "./decoders"

Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs), all exports come from the single main bundle.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add additional exports</issue_title>
<issue_description>- Type issue: Namespace '".../transformers.js/node_modules/@huggingface/tokenizers/types/index"' has no exported member 'Encoding'.ts(2694)

We should expose the different components. Maybe with separate exports like it is in the rust/python library: from tokenizers.pre_tokenizers import Metaspace. This is necessary because... for example... certain llama tokenizers need to handle a legacy mode to ensure backwards compatibility (see here)

For example, these python imports/exports:
from tokenizers import Tokenizer, Encoding
from tokenizers.pre_tokenizers import Metaspace, Whitespace
from tokenizers.models import BPE
...
should be importable in JS like this:
import { Tokenizer, Encoding } from "@huggingface/tokenizers";
import { Metaspace, Whitespace } from "@huggingface/tokenizers/pre-tokenizers";
import { BPE } from "@huggingface/tokenizers/models";
```</issue_description>

<agent_instructions>Ensure that the correct exports are handled in the library, to ensure we match with the rust tokenizers library. For example, these python imports/exports:
from tokenizers import Tokenizer, Encoding
from tokenizers.pre_tokenizers import Metaspace, Whitespace
from tokenizers.models import BPE
...
should be importable in JS like this:
import { Tokenizer, Encoding } from "@huggingface/tokenizers";
import { Metaspace, Whitespace } from "@huggingface/tokenizers/pre-tokenizers";
import { BPE } from "@huggingface/tokenizers/models";
## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>

Fixes Add additional exports #9

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: xenova <[email protected]>

xenova · 2025-11-04T22:32:38Z

@copilot here are all the exports you should expose (as well as other custom ones which may be added)

// decoders
Decoder = decoders.Decoder
ByteLevel = decoders.ByteLevel
Replace = decoders.Replace
WordPiece = decoders.WordPiece
ByteFallback = decoders.ByteFallback
Fuse = decoders.Fuse
Strip = decoders.Strip
Metaspace = decoders.Metaspace
BPEDecoder = decoders.BPEDecoder
CTC = decoders.CTC
Sequence = decoders.Sequence
DecodeStream = decoders.DecodeStream

//models
Model = models.Model
BPE = models.BPE
Unigram = models.Unigram
WordLevel = models.WordLevel
WordPiece = models.WordPiece

// normalizers

Normalizer = normalizers.Normalizer
BertNormalizer = normalizers.BertNormalizer
NFD = normalizers.NFD
NFKD = normalizers.NFKD
NFC = normalizers.NFC
NFKC = normalizers.NFKC
Sequence = normalizers.Sequence
Lowercase = normalizers.Lowercase
Prepend = normalizers.Prepend
Strip = normalizers.Strip
StripAccents = normalizers.StripAccents
Nmt = normalizers.Nmt
Precompiled = normalizers.Precompiled
Replace = normalizers.Replace
ByteLevel = normalizers.ByteLevel

//pre-tokenizers
PreTokenizer = pre_tokenizers.PreTokenizer
BertPreTokenizer = pre_tokenizers.BertPreTokenizer
ByteLevel = pre_tokenizers.ByteLevel
CharDelimiterSplit = pre_tokenizers.CharDelimiterSplit
Digits = pre_tokenizers.Digits
FixedLength = pre_tokenizers.FixedLength
Metaspace = pre_tokenizers.Metaspace
Punctuation = pre_tokenizers.Punctuation
Sequence = pre_tokenizers.Sequence
Split = pre_tokenizers.Split
UnicodeScripts = pre_tokenizers.UnicodeScripts
Whitespace = pre_tokenizers.Whitespace
WhitespaceSplit = pre_tokenizers.WhitespaceSplit

// processors
PostProcessor = processors.PostProcessor
BertProcessing = processors.BertProcessing
ByteLevel = processors.ByteLevel
RobertaProcessing = processors.RobertaProcessing
Sequence = processors.Sequence
TemplateProcessing = processors.TemplateProcessing

NOTE: you said that you keep both exports for compatibility in some cases (Both short (Metaspace) and long (MetaspacePreTokenizer) names are exported for compatibility.) but this is BAD! You should only export it once ("Metaspace" is correct).

you are missing a bunch of these in package.json

… use correct short names Co-authored-by: xenova <[email protected]>

Copilot · 2025-11-04T22:38:27Z

@copilot here are all the exports you should expose (as well as other custom ones which may be added)
// decoders

...

Fixed in commit 494b464:

Removed duplicate exports (now only short names like Metaspace, not MetaspacePreTokenizer)
Added base class exports: Decoder, Model, Normalizer, PreTokenizer, PostProcessor
Updated all names to match Python API (e.g., ByteLevel instead of ByteLevelPreTokenizer)
Added Sequence exports for all component types
Removed UnicodeNormalizer export (not in Python API)
Changed StripNormalizer to Strip to match Python naming

All exports now match the specification you provided.

xenova · 2025-11-04T22:44:35Z

@copilot I only want one built file... but the library should correctly export submodules so that they can be imported accordingly.

So, we ideally only 4 files should be generated (what it was before)... but we should be able to import multiple elements. Is this possible?

Co-authored-by: xenova <[email protected]>

Copilot · 2025-11-04T22:52:07Z

@copilot I only want one built file... but the library should correctly export submodules so that they can be imported accordingly.

So, we ideally only 4 files should be generated (what it was befo...

Done in commit 3f539bb. The build now generates:

4 main bundle files (the actual built code):

tokenizers.mjs/cjs/min.mjs/min.cjs (~28-80KB each)

20 tiny wrapper files (67-404 bytes each):

pre-tokenizers.mjs/cjs/min.mjs/min.cjs
models.mjs/cjs/min.mjs/min.cjs
normalizers.mjs/cjs/min.mjs/min.cjs
decoders.mjs/cjs/min.mjs/min.cjs
post-processors.mjs/cjs/min.mjs/min.cjs

The wrapper files just re-export specific symbols from the main bundle (e.g., export { Metaspace, Whitespace, ... } from './tokenizers.min.mjs'), so there's only one actual built bundle. The package.json exports map submodule paths to these wrapper files with their corresponding type definitions.

xenova · 2025-11-04T23:00:06Z

@copilot okay, since that's not really possible... this is what I want instead:

Remove the latest stuff you did for custom/multiple exports. All these things should be exported from the main export, which will be the only export. So, you'll remove the jest module changes, as well as changes to package.json which add these additional exports. Finally, you'll also undo the "Generated wrapper files for submodule exports" section in the build script
When exporting, don't do any renaming... this will ensure all the names are unique. For example Sequence as NormalizerSequence will be exported as NormalizerSequence.

Co-authored-by: xenova <[email protected]>

Copilot · 2025-11-04T23:07:50Z

@copilot okay, since that's not really possible... this is what I want instead:

Remove the latest stuff you did for custom/multiple exports. All these things should be exported from the main expor...

Done in commit 0c6f429:

✅ Removed all custom export paths from package.json (only main "." export remains)
✅ Removed wrapper file generation from build script
✅ Removed custom Jest module mappers for submodule imports
✅ Updated exports to use full unique names without renaming:
- ByteLevel → ByteLevelPreTokenizer (for pre-tokenizers)
- ByteLevel → ByteLevelDecoder (for decoders)
- ByteLevel → ByteLevelPostProcessor (for post-processors)
- Metaspace → MetaspacePreTokenizer (for pre-tokenizers)
- Metaspace → MetaspaceDecoder (for decoders)
- Sequence → NormalizerSequence, DecoderSequence, PreTokenizerSequence, PostProcessorSequence
- etc.

All exports now come from the single main export with unique names. Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs).

xenova · 2025-11-04T23:11:22Z

@copilot you're pretty much there... but you do this weird thing where you rename something like this:

export { default as Sequence } from "./core/normalizer/NormalizerSequence";

and then

export {
  Sequence as NormalizerSequence,
} from "./normalizers";

just use the same name throughout (export NormalizerSequence from "./core/normalizer/NormalizerSequence";)

there are a bunch of other places you do that.

Co-authored-by: xenova <[email protected]>

Copilot · 2025-11-04T23:16:07Z

@copilot you're pretty much there... but you do this weird thing where you rename something like this:
export { default as Sequence } from "./core/normalizer/NormalizerSequence";

...

Fixed in commit 1fc5c5c. Removed all unnecessary renaming:

Submodule files now export with final names directly: export { default as NormalizerSequence } from "./core/normalizer/NormalizerSequence"
Main index.ts re-exports without renaming: export { NormalizerSequence } from "./normalizers"

Applied to all exports: ByteLevelDecoder, ByteLevelPreTokenizer, ByteLevelPostProcessor, MetaspaceDecoder, MetaspacePreTokenizer, DecoderSequence, NormalizerSequence, PreTokenizerSequence, PostProcessorSequence, StripDecoder, ReplaceDecoder, WordPieceDecoder, etc.

Initial plan

0dab200

Copilot AI assigned Copilot and xenova Nov 4, 2025

Copilot started work on behalf of xenova November 4, 2025 20:40 View session

Copilot AI and others added 4 commits November 4, 2025 20:49

Add additional exports for Encoding, pre-tokenizers, and models

23ca660

Co-authored-by: xenova <[email protected]>

Add exports for normalizers, decoders, and post-processors

9c8b4ac

Co-authored-by: xenova <[email protected]>

Add comprehensive tests for package exports and update Jest config

6901c5f

Co-authored-by: xenova <[email protected]>

Update README with component export examples

cb66878

Co-authored-by: xenova <[email protected]>

Copilot AI changed the title ~~[WIP] Add additional exports for backward compatibility~~ Add component exports matching Python tokenizers library structure Nov 4, 2025

Copilot AI requested a review from xenova November 4, 2025 20:59

Copilot finished work on behalf of xenova November 4, 2025 20:59

Copilot started work on behalf of xenova November 4, 2025 22:33 View session

Fix exports to match Python API: remove duplicates, add base classes,…

494b464

… use correct short names Co-authored-by: xenova <[email protected]>

Copilot finished work on behalf of xenova November 4, 2025 22:40

Copilot started work on behalf of xenova November 4, 2025 22:45 View session

Use single bundle with tiny wrapper files for submodule exports

3f539bb

Co-authored-by: xenova <[email protected]>

Copilot finished work on behalf of xenova November 4, 2025 22:53

Copilot started work on behalf of xenova November 4, 2025 23:02 View session

Simplify to single main export with unique names for all components

0c6f429

Co-authored-by: xenova <[email protected]>

Copilot finished work on behalf of xenova November 4, 2025 23:08

Copilot started work on behalf of xenova November 4, 2025 23:11 View session

Remove unnecessary renaming in exports - use final names throughout

1fc5c5c

Co-authored-by: xenova <[email protected]>

Copilot finished work on behalf of xenova November 4, 2025 23:16

Cleanup

6c5b6c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add component exports matching Python tokenizers library structure #10

Add component exports matching Python tokenizers library structure #10

Uh oh!

Copilot AI commented Nov 4, 2025 •

edited

Loading

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add component exports matching Python tokenizers library structure #10

Are you sure you want to change the base?

Add component exports matching Python tokenizers library structure #10

Uh oh!

Conversation

Copilot AI commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan to Add Additional Exports

Final Implementation

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

xenova commented Nov 4, 2025

Uh oh!

Copilot AI commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 4, 2025 •

edited

Loading