-
Notifications
You must be signed in to change notification settings - Fork 1
Add component exports matching Python tokenizers library structure #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: xenova <[email protected]>
Co-authored-by: xenova <[email protected]>
Co-authored-by: xenova <[email protected]>
Co-authored-by: xenova <[email protected]>
|
@copilot here are all the exports you should expose (as well as other custom ones which may be added) NOTE: you said that you keep both exports for compatibility in some cases (Both short (Metaspace) and long (MetaspacePreTokenizer) names are exported for compatibility.) but this is BAD! You should only export it once ("Metaspace" is correct). you are missing a bunch of these in package.json |
… use correct short names Co-authored-by: xenova <[email protected]>
... Fixed in commit 494b464:
All exports now match the specification you provided. |
|
@copilot I only want one built file... but the library should correctly export submodules so that they can be imported accordingly. So, we ideally only 4 files should be generated (what it was before)... but we should be able to import multiple elements. Is this possible? |
Co-authored-by: xenova <[email protected]>
Done in commit 3f539bb. The build now generates: 4 main bundle files (the actual built code):
20 tiny wrapper files (67-404 bytes each):
The wrapper files just re-export specific symbols from the main bundle (e.g., |
|
@copilot okay, since that's not really possible... this is what I want instead:
|
Co-authored-by: xenova <[email protected]>
Done in commit 0c6f429:
All exports now come from the single main export with unique names. Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs). |
|
@copilot you're pretty much there... but you do this weird thing where you rename something like this: and then just use the same name throughout (export NormalizerSequence from "./core/normalizer/NormalizerSequence";) there are a bunch of other places you do that. |
Co-authored-by: xenova <[email protected]>
... Fixed in commit 1fc5c5c. Removed all unnecessary renaming:
Applied to all exports: |
Plan to Add Additional Exports
Encodingexport to main index (src/index.ts)src/pre-tokenizers.tsto export pre-tokenizer classessrc/models.tsto export model classes (BPE, Unigram, WordPiece)src/normalizers.tsto export normalizer classessrc/decoders.tsto export decoder classessrc/post-processors.tsto export post-processor classesFinal Implementation
All components are now exported from the main
@huggingface/tokenizersexport with unique names:ByteLevelPreTokenizer,MetaspacePreTokenizer,PreTokenizerSequence, etc.BPE,Unigram,WordPiece,ModelNormalizerSequence,Strip,Replace, etc.ByteLevelDecoder,MetaspaceDecoder,DecoderSequence, etc.ByteLevelPostProcessor,PostProcessorSequence, etc.The export chain is now clean:
decoders.ts) export with final names:export { default as ByteLevelDecoder } from "./core/decoder/ByteLevelDecoder"export { ByteLevelDecoder } from "./decoders"Only 4 build files are generated (tokenizers.mjs/cjs/min.mjs/min.cjs), all exports come from the single main bundle.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.