Remove padding and truncation logic #8

Copilot · 2025-11-04T20:35:30Z

Padding and truncation are tensor-shaping concerns that don't belong in a library with no tensor concept. Moving this logic to transformers.js where it's actually needed.

Changes

Type definitions (src/static/tokenizer.d.ts):

Removed TruncationConfig and PaddingConfig interfaces
Removed truncation and padding fields from TokenizerJSON
Removed from TokenizerConfig: padding_side, truncation_side, model_max_length, max_length, stride, pad_to_multiple_of

Implementation (src/core/Tokenizer.ts):

Removed unused model_max_length field

Notes

Special token definitions (pad_token, bos_token, etc.) remain unchanged—these are tokenization concerns, not padding logic
encode() already only accepts string, not string[]—no changes needed

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

huggingface.co
- Triggering command: node -e import { Tokenizer } from "@huggingface/tokenizers"; const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
--input-type=module` (dns block)

Triggering command: node -e const { Tokenizer } = require("@huggingface/tokenizers");(async function() { const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)

Triggering command: node -e (async function() { const { Tokenizer } = await import("@huggingface/tokenizers"); const modelId = "hf-internal-testing/tiny-random-LlamaForCausalLM"; const tokenizerJson = await fetch(REDACTED${modelId}/resolve/main/tokenizer.json).then(res => res.json()); const tokenizerConfig = await fetch(REDACTED${modelId}/resolve/main/tokenizer_config.json`).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize('Hello World');
const encoded = tokenizer.encode('Hello World');
const decoded = tokenizer.decode(encoded.ids);

console.log(tokens);
console.log(encoded);
console.log(decoded);
})();` (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled

Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Remove padding and truncation logic</issue_title>
<issue_description>Remove any padding/truncation logic from tokenizers.js, and move that to transformers.js. This makes sense actually since the only purpose of padding/truncation is to create a structure which can be converted to a tensor (fixed shape). But since tokenizers.js has no concept of Tensor, we don't need this logic in it.</issue_description>

<agent_instructions>Remove any padding/truncation logic from the entire library. We should only be able to encode string, not string[]. We will eventually move this logic to a dependent (transformers.js, but you don't need to worry about that).</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Remove padding and truncation logic #7

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: xenova <[email protected]>

xenova

Forgot to remove truncate helper, which I've now removed

Initial plan

35454f0

Copilot AI assigned Copilot and xenova Nov 4, 2025

Copilot started work on behalf of xenova November 4, 2025 20:36 View session

Remove padding and truncation logic from tokenizers.js

4078aaa

Co-authored-by: xenova <[email protected]>

Copilot AI changed the title ~~[WIP] Remove padding and truncation logic from tokenizers.js~~ Remove padding and truncation logic Nov 4, 2025

Copilot AI requested a review from xenova November 4, 2025 20:47

Copilot finished work on behalf of xenova November 4, 2025 20:47

Remove truncate helper

f90206f

xenova marked this pull request as ready for review November 4, 2025 22:25

xenova approved these changes Nov 4, 2025

View reviewed changes

xenova merged commit 3c1b582 into main Nov 4, 2025
6 checks passed

xenova deleted the copilot/remove-padding-truncation-logic branch November 4, 2025 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove padding and truncation logic #8

Remove padding and truncation logic #8

Copilot AI commented Nov 4, 2025 •

edited

Loading

Uh oh!

xenova left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove padding and truncation logic #8

Remove padding and truncation logic #8

Conversation

Copilot AI commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Notes

I tried to connect to the following addresses, but was blocked by firewall rules:

Comments on the Issue (you are @copilot in this section)

Uh oh!

xenova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 4, 2025 •

edited

Loading