Remove padding and truncation logic #8
Merged
+0
−53
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Padding and truncation are tensor-shaping concerns that don't belong in a library with no tensor concept. Moving this logic to transformers.js where it's actually needed.
Changes
Type definitions (
src/static/tokenizer.d.ts):TruncationConfigandPaddingConfiginterfacestruncationandpaddingfields fromTokenizerJSONTokenizerConfig:padding_side,truncation_side,model_max_length,max_length,stride,pad_to_multiple_ofImplementation (
src/core/Tokenizer.ts):model_max_lengthfieldNotes
pad_token,bos_token, etc.) remain unchanged—these are tokenization concerns, not padding logicencode()already only acceptsstring, notstring[]—no changes neededWarning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
huggingface.conode -e import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);
console.log(tokens);
console.log(encoded);
console.log(decoded);
--input-type=module` (dns block)
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);
console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);
console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.