Update dataset names in README files (#2330)

Grzegorz Pluto-Prondzinski · web-flow · commit 5117a3f43c2b · 2025-10-30T14:29:29.000+01:00
diff --git a/examples/audio-classification/README.md b/examples/audio-classification/README.md
@@ -29,13 +29,13 @@ pip install -r requirements.txt
 
 ## Single-HPU
 
-The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset on a single HPU.
+The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/regisss/superb_ks) of the SUPERB dataset on a single HPU.
 
 ```bash
 PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
-    --dataset_name superb \
-    --dataset_config_name ks \
+    --dataset_name regisss/superb_ks \
+    --dataset_config_name default \
     --output_dir /tmp/wav2vec2-base-ft-keyword-spotting \
     --overwrite_output_dir \
     --remove_unused_columns False \
@@ -58,7 +58,6 @@ PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --throughput_warmup_steps 3 \
     --sdp_on_bf16 \
     --bf16 \
-    --trust_remote_code True \
     --attn_implementation gaudi_fused_sdpa
 ```
 
@@ -69,13 +68,13 @@ On a single HPU, this script should run in ~13 minutes and yield an accuracy of
 
 ## Multi-HPU
 
-The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/anton-l/common_language) on 8 HPUs.
+The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/regisss/common_language) on 8 HPUs.
 
 ```bash
 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
-    --dataset_name common_language \
+    --dataset_name regisss/common_language \
     --audio_column_name audio \
     --label_column_name language \
     --output_dir /tmp/wav2vec2-base-lang-id \
@@ -117,8 +116,8 @@ For instance, you can run inference with Wav2Vec2 on the Keyword Spotting subset
 ```bash
 PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
-    --dataset_name superb \
-    --dataset_config_name ks \
+    --dataset_name regisss/superb_ks \
+    --dataset_config_name default \
     --output_dir /tmp/wav2vec2-base-ft-keyword-spotting \
     --overwrite_output_dir \
     --remove_unused_columns False \
@@ -133,7 +132,6 @@ PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
     --use_hpu_graphs_for_inference \
     --throughput_warmup_steps 3 \
     --gaudi_config_name Habana/wav2vec2 \
-    --trust_remote_code
 ```
 
 
diff --git a/examples/contrastive-image-text/README.md b/examples/contrastive-image-text/README.md
@@ -35,7 +35,7 @@ pip install -r requirements.txt
 **Recommended (datasets>=4.0.0):** use the COCO captions dataset hosted on the Hub. It provides image–caption pairs and does **not** require `trust_remote_code`:
 ```python
 import datasets
-ds = datasets.load_dataset("sentence-transformers/coco-captions", split="train")
+ds = datasets.load_dataset("regisss/coco_2017", split="train")
 ```
 This dataset exposes at least the columns `image` (PIL image) and `caption` (string).
 If you prefer local files, you can also use the built-in Datasets `imagefolder` builder (not a placeholder) to load images/captions from a directory (it typically expects a small CSV/JSON with columns such as `image_path` and `caption`).
@@ -84,7 +84,7 @@ Run the following command for single-device training:
 python run_clip.py \
     --output_dir ./clip-roberta-finetuned \
     --model_name_or_path ./clip-roberta \
-    --dataset_name sentence-transformers/coco-captions \
+    --dataset_name regisss/coco_2017 \
     --image_column image \
     --caption_column caption \
     --remove_unused_columns=False \
@@ -100,7 +100,6 @@ python run_clip.py \
     --dataloader_num_workers 16 \
     --sdp_on_bf16 \
     --bf16 \
-    --trust_remote_code \
     --torch_compile_backend=hpu_backend \
     --torch_compile
 ```
@@ -115,7 +114,7 @@ PT_ENABLE_INT64_SUPPORT=1 \
 python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
     --output_dir=/tmp/clip_roberta \
     --model_name_or_path=./clip-roberta \
-    --dataset_name sentence-transformers/coco-captions \ 
+    --dataset_name regisss/coco_2017 \ 
     --image_column image \
     --caption_column caption \
     --remove_unused_columns=False \
@@ -136,7 +135,6 @@ python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
     --torch_compile_backend=hpu_backend \
     --torch_compile \
     --logging_nan_inf_filter \
-    --trust_remote_code
 ```
 
 > [!NOTE]
@@ -159,7 +157,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridget
   --output_dir /tmp/bridgetower-test \
   --model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
   --dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \
-  --dataset_revision 3c6c4f6c0ff7e902833d3afa5f8f3875c2b036e6 \
   --image_column image --caption_column image_description \
   --remove_unused_columns=False \
   --do_train --do_eval --do_predict \
@@ -172,8 +169,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridget
   --throughput_warmup_steps 3 \
   --logging_steps 10 \
   --dataloader_num_workers 1 \
-  --mediapipe_dataloader \
-  --trust_remote_code \
   --sdp_on_bf16
 ```
 
@@ -190,7 +185,7 @@ For instance, you can run inference with CLIP on COCO on 1 Gaudi card with the f
 PT_HPU_LAZY_MODE=1 python run_clip.py \
     --output_dir ./clip-roberta-finetuned \
     --model_name_or_path ./clip-roberta \
-    --dataset_name sentence-transformers/coco-captions \
+    --dataset_name regisss/coco_2017 \
     --image_column image \
     --caption_column caption \
     --remove_unused_columns=False \
@@ -204,7 +199,6 @@ PT_HPU_LAZY_MODE=1 python run_clip.py \
     --bf16 \
     --sdp_on_bf16 \
     --mediapipe_dataloader \
-    --trust_remote_code
 ```
 
 > [!NOTE]
diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md
@@ -231,13 +231,11 @@ recognition on one of the well known speech recognition datasets similar to show
 We can load all components of the Whisper model directly from the pretrained checkpoint, including the pretrained model weights, feature extractor and tokenizer. We simply have to specify our fine-tuning dataset and training hyperparameters.
 
 ### Single HPU Whisper Fine tuning with Seq2Seq
-The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using a single HPU device in bf16 precision:
+The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using a single HPU device in bf16 precision:
 ```bash
 PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
     --model_name_or_path="openai/whisper-small" \
-    --dataset_name="mozilla-foundation/common_voice_11_0" \
-    --trust_remote_code \
-    --dataset_config_name="hi" \
+    --dataset_name="regisss/common_voice_11_0_hi" \
     --language="hindi" \
     --task="transcribe" \
     --train_split_name="train+validation" \
@@ -277,14 +275,12 @@ If training on a different language, you should be sure to change the `language`
 
 
 ### Multi HPU Whisper Training with Seq2Seq
-The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
+The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using 8 HPU devices in half-precision:
 ```bash
 PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
     --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
     --model_name_or_path="openai/whisper-large" \
-    --dataset_name="mozilla-foundation/common_voice_11_0" \
-    --trust_remote_code \
-    --dataset_config_name="hi" \
+    --dataset_name="regisss/common_voice_11_0_hi" \
     --language="hindi" \
     --task="transcribe" \
     --train_split_name="train+validation" \
@@ -317,14 +313,12 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
 
 #### Single HPU Seq2Seq Inference
 
-The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 1 HPU devices in half-precision:
+The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using 1 HPU devices in half-precision:
 
 ```bash
 PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
     --model_name_or_path="openai/whisper-small" \
-    --dataset_name="mozilla-foundation/common_voice_11_0" \
-    --trust_remote_code \
-    --dataset_config_name="hi" \
+    --dataset_name="regisss/common_voice_11_0_hi" \
     --language="hindi" \
     --task="transcribe" \
     --eval_split_name="test" \
diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md
@@ -215,19 +215,18 @@ You can also provide the name of a dataset from the Hugging Face Hub to perform
 
 By default, the first column in the dataset of type `string` will be used as prompts. You can also select the column you want with the argument `--column_name`.
 
-Here is an example with [JulesBelveze/tldr_news](https://huggingface.co/datasets/JulesBelveze/tldr_news):
+Here is an example with [dim/tldr_news](https://huggingface.co/datasets/dim/tldr_news):
 ```bash
 PT_HPU_LAZY_MODE=1 python run_generation.py \
 --model_name_or_path gpt2 \
 --batch_size 2 \
 --max_new_tokens 100 \
 --use_hpu_graphs \
 --use_kv_cache \
---dataset_name JulesBelveze/tldr_news \
+--dataset_name dim/tldr_news \
 --column_name content \
 --bf16 \
 --sdp_on_bf16 \
---trust_remote_code
 ```
 
 > The prompt length is limited to 16 tokens. Prompts longer than this will be truncated.
diff --git a/examples/translation/README.md b/examples/translation/README.md
@@ -103,7 +103,7 @@ The task of translation supports only custom JSONLINES files, with each line bei
 ```
 Here the languages are Romanian (`ro`) and English (`en`).
 
-If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as follows:
+If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name regisss/wmt14-en-de-pre-processed`, as follows:
 
 ```bash
 PT_HPU_LAZY_MODE=1 python run_translation.py \
@@ -113,7 +113,7 @@ PT_HPU_LAZY_MODE=1 python run_translation.py \
     --source_lang en \
     --target_lang de \
     --source_prefix "translate English to German: " \
-    --dataset_name stas/wmt14-en-de-pre-processed \
+    --dataset_name regisss/wmt14-en-de-pre-processed \
     --output_dir /tmp/tst-translation \
     --per_device_train_batch_size 4 \
     --per_device_eval_batch_size 4 \