Skip to content

Conversation

@CaptainArshia
Copy link

@CaptainArshia CaptainArshia commented Nov 30, 2025

image image

This commit fixes a data processing bug in the tokenization examples across all language translations of Chapter 3, Section 2.

The Problem:
The code was passing dataset columns directly to the tokenizer, which caused compatibility issues.

The Fix:
Converted the dataset columns to lists before tokenization by wrapping them in list():

Changed: raw_datasets["train"]["sentence1"]
To: list(raw_datasets["train"]["sentence1"])

Impact:
This change was applied consistently across all languages versions to ensure the code examples work correctly when tokenizing sentence pairs from the MRPC dataset.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants