Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 4, 2025

Padding and truncation are tensor-shaping concerns that don't belong in a library with no tensor concept. Moving this logic to transformers.js where it's actually needed.

Changes

Type definitions (src/static/tokenizer.d.ts):

  • Removed TruncationConfig and PaddingConfig interfaces
  • Removed truncation and padding fields from TokenizerJSON
  • Removed from TokenizerConfig: padding_side, truncation_side, model_max_length, max_length, stride, pad_to_multiple_of

Implementation (src/core/Tokenizer.ts):

  • Removed unused model_max_length field

Notes

  • Special token definitions (pad_token, bos_token, etc.) remain unchanged—these are tokenization concerns, not padding logic
  • encode() already only accepts string, not string[]—no changes needed

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: node -e import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
--input-type=module` (dns block)

  • Triggering command: node -e const { Tokenizer } = require("@huggingface/tokenizers");(async function() { const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)

  • Triggering command: node -e (async function() { const { Tokenizer } = await import("@huggingface/tokenizers"); const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Remove padding and truncation logic</issue_title>
<issue_description>Remove any padding/truncation logic from tokenizers.js, and move that to transformers.js. This makes sense actually since the only purpose of padding/truncation is to create a structure which can be converted to a tensor (fixed shape). But since tokenizers.js has no concept of Tensor, we don't need this logic in it.</issue_description>

<agent_instructions>Remove any padding/truncation logic from the entire library. We should only be able to encode string, not string[]. We will eventually move this logic to a dependent (transformers.js, but you don't need to worry about that).</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Remove padding and truncation logic from tokenizers.js Remove padding and truncation logic Nov 4, 2025
Copilot AI requested a review from xenova November 4, 2025 20:47
Copilot finished work on behalf of xenova November 4, 2025 20:47
@xenova xenova marked this pull request as ready for review November 4, 2025 22:25
Copy link
Collaborator

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove truncate helper, which I've now removed

@xenova xenova merged commit 3c1b582 into main Nov 4, 2025
6 checks passed
@xenova xenova deleted the copilot/remove-padding-truncation-logic branch November 4, 2025 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove padding and truncation logic

2 participants