Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 4, 2025

Plan to Add Additional Exports

  • Explore repository structure and understand build process
  • Identify current exports and missing exports
  • Add Encoding export to main index (src/index.ts)
  • Create src/pre-tokenizers.ts to export pre-tokenizer classes
  • Create src/models.ts to export model classes (BPE, Unigram, WordPiece)
  • Create src/normalizers.ts to export normalizer classes
  • Create src/decoders.ts to export decoder classes
  • Create src/post-processors.ts to export post-processor classes
  • Export all components from main index with unique names (no renaming, use full names)
  • Build only 4 main bundle files (same as before)
  • Update tests to import from main export
  • Remove submodule export paths from package.json
  • Remove wrapper file generation from build script
  • Remove custom Jest module mappers
  • Remove unnecessary renaming - use final names throughout all export files
  • Final validation

Final Implementation

All components are now exported from the main @huggingface/tokenizers export with unique names:

  • Pre-tokenizers: ByteLevelPreTokenizer, MetaspacePreTokenizer, PreTokenizerSequence, etc.
  • Models: BPE, Unigram, WordPiece, Model
  • Normalizers: NormalizerSequence, Strip, Replace, etc.
  • Decoders: ByteLevelDecoder, MetaspaceDecoder, DecoderSequence, etc.
  • Post-processors: ByteLevelPostProcessor, PostProcessorSequence, etc.

The export chain is now clean:

  1. Submodule files (e.g., decoders.ts) export with final names: export { default as ByteLevelDecoder } from "./core/decoder/ByteLevelDecoder"
  2. Main index.ts re-exports without renaming: export { ByteLevelDecoder } from "./decoders"

Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs), all exports come from the single main bundle.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add additional exports</issue_title>
<issue_description>- Type issue: Namespace '".../transformers.js/node_modules/@huggingface/tokenizers/types/index"' has no exported member 'Encoding'.ts(2694)

  • We should expose the different components. Maybe with separate exports like it is in the rust/python library: from tokenizers.pre_tokenizers import Metaspace. This is necessary because... for example... certain llama tokenizers need to handle a legacy mode to ensure backwards compatibility (see here)

For example, these python imports/exports:

from tokenizers import Tokenizer, Encoding
from tokenizers.pre_tokenizers import Metaspace, Whitespace
from tokenizers.models import BPE
...

should be importable in JS like this:

import { Tokenizer, Encoding } from "@huggingface/tokenizers";
import { Metaspace, Whitespace } from "@huggingface/tokenizers/pre-tokenizers";
import { BPE } from "@huggingface/tokenizers/models";
```</issue_description>

<agent_instructions>Ensure that the correct exports are handled in the library, to ensure we match with the rust tokenizers library. For example, these python imports/exports:

from tokenizers import Tokenizer, Encoding
from tokenizers.pre_tokenizers import Metaspace, Whitespace
from tokenizers.models import BPE
...

should be importable in JS like this:

import { Tokenizer, Encoding } from "@huggingface/tokenizers";
import { Metaspace, Whitespace } from "@huggingface/tokenizers/pre-tokenizers";
import { BPE } from "@huggingface/tokenizers/models";


## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Add additional exports for backward compatibility Add component exports matching Python tokenizers library structure Nov 4, 2025
Copilot AI requested a review from xenova November 4, 2025 20:59
Copilot finished work on behalf of xenova November 4, 2025 20:59
@xenova
Copy link
Collaborator

xenova commented Nov 4, 2025

@copilot here are all the exports you should expose (as well as other custom ones which may be added)

// decoders
Decoder = decoders.Decoder
ByteLevel = decoders.ByteLevel
Replace = decoders.Replace
WordPiece = decoders.WordPiece
ByteFallback = decoders.ByteFallback
Fuse = decoders.Fuse
Strip = decoders.Strip
Metaspace = decoders.Metaspace
BPEDecoder = decoders.BPEDecoder
CTC = decoders.CTC
Sequence = decoders.Sequence
DecodeStream = decoders.DecodeStream
//models
Model = models.Model
BPE = models.BPE
Unigram = models.Unigram
WordLevel = models.WordLevel
WordPiece = models.WordPiece
// normalizers

Normalizer = normalizers.Normalizer
BertNormalizer = normalizers.BertNormalizer
NFD = normalizers.NFD
NFKD = normalizers.NFKD
NFC = normalizers.NFC
NFKC = normalizers.NFKC
Sequence = normalizers.Sequence
Lowercase = normalizers.Lowercase
Prepend = normalizers.Prepend
Strip = normalizers.Strip
StripAccents = normalizers.StripAccents
Nmt = normalizers.Nmt
Precompiled = normalizers.Precompiled
Replace = normalizers.Replace
ByteLevel = normalizers.ByteLevel
//pre-tokenizers
PreTokenizer = pre_tokenizers.PreTokenizer
BertPreTokenizer = pre_tokenizers.BertPreTokenizer
ByteLevel = pre_tokenizers.ByteLevel
CharDelimiterSplit = pre_tokenizers.CharDelimiterSplit
Digits = pre_tokenizers.Digits
FixedLength = pre_tokenizers.FixedLength
Metaspace = pre_tokenizers.Metaspace
Punctuation = pre_tokenizers.Punctuation
Sequence = pre_tokenizers.Sequence
Split = pre_tokenizers.Split
UnicodeScripts = pre_tokenizers.UnicodeScripts
Whitespace = pre_tokenizers.Whitespace
WhitespaceSplit = pre_tokenizers.WhitespaceSplit
// processors
PostProcessor = processors.PostProcessor
BertProcessing = processors.BertProcessing
ByteLevel = processors.ByteLevel
RobertaProcessing = processors.RobertaProcessing
Sequence = processors.Sequence
TemplateProcessing = processors.TemplateProcessing

NOTE: you said that you keep both exports for compatibility in some cases (Both short (Metaspace) and long (MetaspacePreTokenizer) names are exported for compatibility.) but this is BAD! You should only export it once ("Metaspace" is correct).

you are missing a bunch of these in package.json

Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot here are all the exports you should expose (as well as other custom ones which may be added)

// decoders

...

Fixed in commit 494b464:

  • Removed duplicate exports (now only short names like Metaspace, not MetaspacePreTokenizer)
  • Added base class exports: Decoder, Model, Normalizer, PreTokenizer, PostProcessor
  • Updated all names to match Python API (e.g., ByteLevel instead of ByteLevelPreTokenizer)
  • Added Sequence exports for all component types
  • Removed UnicodeNormalizer export (not in Python API)
  • Changed StripNormalizer to Strip to match Python naming

All exports now match the specification you provided.

Copilot finished work on behalf of xenova November 4, 2025 22:40
@xenova
Copy link
Collaborator

xenova commented Nov 4, 2025

@copilot I only want one built file... but the library should correctly export submodules so that they can be imported accordingly.

So, we ideally only 4 files should be generated (what it was before)... but we should be able to import multiple elements. Is this possible?

Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot I only want one built file... but the library should correctly export submodules so that they can be imported accordingly.

So, we ideally only 4 files should be generated (what it was befo...

Done in commit 3f539bb. The build now generates:

4 main bundle files (the actual built code):

  • tokenizers.mjs/cjs/min.mjs/min.cjs (~28-80KB each)

20 tiny wrapper files (67-404 bytes each):

  • pre-tokenizers.mjs/cjs/min.mjs/min.cjs
  • models.mjs/cjs/min.mjs/min.cjs
  • normalizers.mjs/cjs/min.mjs/min.cjs
  • decoders.mjs/cjs/min.mjs/min.cjs
  • post-processors.mjs/cjs/min.mjs/min.cjs

The wrapper files just re-export specific symbols from the main bundle (e.g., export { Metaspace, Whitespace, ... } from './tokenizers.min.mjs'), so there's only one actual built bundle. The package.json exports map submodule paths to these wrapper files with their corresponding type definitions.

Copilot finished work on behalf of xenova November 4, 2025 22:53
@xenova
Copy link
Collaborator

xenova commented Nov 4, 2025

@copilot okay, since that's not really possible... this is what I want instead:

  1. Remove the latest stuff you did for custom/multiple exports. All these things should be exported from the main export, which will be the only export. So, you'll remove the jest module changes, as well as changes to package.json which add these additional exports. Finally, you'll also undo the "Generated wrapper files for submodule exports" section in the build script
  2. When exporting, don't do any renaming... this will ensure all the names are unique. For example Sequence as NormalizerSequence will be exported as NormalizerSequence.

Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot okay, since that's not really possible... this is what I want instead:

  1. Remove the latest stuff you did for custom/multiple exports. All these things should be exported from the main expor...

Done in commit 0c6f429:

  1. ✅ Removed all custom export paths from package.json (only main "." export remains)
  2. ✅ Removed wrapper file generation from build script
  3. ✅ Removed custom Jest module mappers for submodule imports
  4. ✅ Updated exports to use full unique names without renaming:
    • ByteLevelByteLevelPreTokenizer (for pre-tokenizers)
    • ByteLevelByteLevelDecoder (for decoders)
    • ByteLevelByteLevelPostProcessor (for post-processors)
    • MetaspaceMetaspacePreTokenizer (for pre-tokenizers)
    • MetaspaceMetaspaceDecoder (for decoders)
    • SequenceNormalizerSequence, DecoderSequence, PreTokenizerSequence, PostProcessorSequence
    • etc.

All exports now come from the single main export with unique names. Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs).

Copilot finished work on behalf of xenova November 4, 2025 23:08
@xenova
Copy link
Collaborator

xenova commented Nov 4, 2025

@copilot you're pretty much there... but you do this weird thing where you rename something like this:

export { default as Sequence } from "./core/normalizer/NormalizerSequence";

and then

export {
  Sequence as NormalizerSequence,
} from "./normalizers";

just use the same name throughout (export NormalizerSequence from "./core/normalizer/NormalizerSequence";)


there are a bunch of other places you do that.

Copy link
Contributor Author

Copilot AI commented Nov 4, 2025

@copilot you're pretty much there... but you do this weird thing where you rename something like this:

export { default as Sequence } from "./core/normalizer/NormalizerSequence";

...

Fixed in commit 1fc5c5c. Removed all unnecessary renaming:

  • Submodule files now export with final names directly: export { default as NormalizerSequence } from "./core/normalizer/NormalizerSequence"
  • Main index.ts re-exports without renaming: export { NormalizerSequence } from "./normalizers"

Applied to all exports: ByteLevelDecoder, ByteLevelPreTokenizer, ByteLevelPostProcessor, MetaspaceDecoder, MetaspacePreTokenizer, DecoderSequence, NormalizerSequence, PreTokenizerSequence, PostProcessorSequence, StripDecoder, ReplaceDecoder, WordPieceDecoder, etc.

Copilot finished work on behalf of xenova November 4, 2025 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add additional exports

2 participants