Merge pull request #895 from arjunaskykok/fix_tokenizer_param_processing_class

stevhliu · web-flow · commit 726835943d7d · 2025-04-25T12:04:07.000-07:00
replace deprecated parameter tokenizer with processing_class in chapt…
diff --git a/chapters/en/chapter3/3.mdx b/chapters/en/chapter3/3.mdx
@@ -58,7 +58,7 @@ model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_label
 
 You will notice that unlike in [Chapter 2](/course/chapter2), you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
 
-Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `tokenizer`:
+Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now — the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class` (e.g., a tokenizer, feature extractor, or processor):
 
 ```py
 from transformers import Trainer
@@ -69,11 +69,11 @@ trainer = Trainer(
     train_dataset=tokenized_datasets["train"],
     eval_dataset=tokenized_datasets["validation"],
     data_collator=data_collator,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
 )
 ```
 
-Note that when you pass the `tokenizer` as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` as defined previously, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2!
+Note that when you pass a tokenizer as the `processing_class`, as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` if the `processing_class` is a tokenizer or feature extractor, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2!
 
 To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:
 
@@ -147,7 +147,7 @@ trainer = Trainer(
     train_dataset=tokenized_datasets["train"],
     eval_dataset=tokenized_datasets["validation"],
     data_collator=data_collator,
-    tokenizer=tokenizer,
+    processing_class=tokenizer,
     compute_metrics=compute_metrics,
 )
 ```