Skip to content

Commit 5117a3f

Browse files
author
Grzegorz Pluto-Prondzinski
authored
Update dataset names in README files (#2330)
1 parent 8a53c69 commit 5117a3f

File tree

5 files changed

+21
-36
lines changed

5 files changed

+21
-36
lines changed

examples/audio-classification/README.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,13 +29,13 @@ pip install -r requirements.txt
2929

3030
## Single-HPU
3131

32-
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset on a single HPU.
32+
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/regisss/superb_ks) of the SUPERB dataset on a single HPU.
3333

3434
```bash
3535
PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
3636
--model_name_or_path facebook/wav2vec2-base \
37-
--dataset_name superb \
38-
--dataset_config_name ks \
37+
--dataset_name regisss/superb_ks \
38+
--dataset_config_name default \
3939
--output_dir /tmp/wav2vec2-base-ft-keyword-spotting \
4040
--overwrite_output_dir \
4141
--remove_unused_columns False \
@@ -58,7 +58,6 @@ PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
5858
--throughput_warmup_steps 3 \
5959
--sdp_on_bf16 \
6060
--bf16 \
61-
--trust_remote_code True \
6261
--attn_implementation gaudi_fused_sdpa
6362
```
6463

@@ -69,13 +68,13 @@ On a single HPU, this script should run in ~13 minutes and yield an accuracy of
6968

7069
## Multi-HPU
7170

72-
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/anton-l/common_language) on 8 HPUs.
71+
The following command shows how to fine-tune [wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) for 🌎 **Language Identification** on the [CommonLanguage dataset](https://huggingface.co/datasets/regisss/common_language) on 8 HPUs.
7372

7473
```bash
7574
python ../gaudi_spawn.py \
7675
--world_size 8 --use_mpi run_audio_classification.py \
7776
--model_name_or_path facebook/wav2vec2-base \
78-
--dataset_name common_language \
77+
--dataset_name regisss/common_language \
7978
--audio_column_name audio \
8079
--label_column_name language \
8180
--output_dir /tmp/wav2vec2-base-lang-id \
@@ -117,8 +116,8 @@ For instance, you can run inference with Wav2Vec2 on the Keyword Spotting subset
117116
```bash
118117
PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
119118
--model_name_or_path facebook/wav2vec2-base \
120-
--dataset_name superb \
121-
--dataset_config_name ks \
119+
--dataset_name regisss/superb_ks \
120+
--dataset_config_name default \
122121
--output_dir /tmp/wav2vec2-base-ft-keyword-spotting \
123122
--overwrite_output_dir \
124123
--remove_unused_columns False \
@@ -133,7 +132,6 @@ PT_HPU_LAZY_MODE=1 python run_audio_classification.py \
133132
--use_hpu_graphs_for_inference \
134133
--throughput_warmup_steps 3 \
135134
--gaudi_config_name Habana/wav2vec2 \
136-
--trust_remote_code
137135
```
138136

139137

examples/contrastive-image-text/README.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ pip install -r requirements.txt
3535
**Recommended (datasets>=4.0.0):** use the COCO captions dataset hosted on the Hub. It provides image–caption pairs and does **not** require `trust_remote_code`:
3636
```python
3737
import datasets
38-
ds = datasets.load_dataset("sentence-transformers/coco-captions", split="train")
38+
ds = datasets.load_dataset("regisss/coco_2017", split="train")
3939
```
4040
This dataset exposes at least the columns `image` (PIL image) and `caption` (string).
4141
If you prefer local files, you can also use the built-in Datasets `imagefolder` builder (not a placeholder) to load images/captions from a directory (it typically expects a small CSV/JSON with columns such as `image_path` and `caption`).
@@ -84,7 +84,7 @@ Run the following command for single-device training:
8484
python run_clip.py \
8585
--output_dir ./clip-roberta-finetuned \
8686
--model_name_or_path ./clip-roberta \
87-
--dataset_name sentence-transformers/coco-captions \
87+
--dataset_name regisss/coco_2017 \
8888
--image_column image \
8989
--caption_column caption \
9090
--remove_unused_columns=False \
@@ -100,7 +100,6 @@ python run_clip.py \
100100
--dataloader_num_workers 16 \
101101
--sdp_on_bf16 \
102102
--bf16 \
103-
--trust_remote_code \
104103
--torch_compile_backend=hpu_backend \
105104
--torch_compile
106105
```
@@ -115,7 +114,7 @@ PT_ENABLE_INT64_SUPPORT=1 \
115114
python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
116115
--output_dir=/tmp/clip_roberta \
117116
--model_name_or_path=./clip-roberta \
118-
--dataset_name sentence-transformers/coco-captions \
117+
--dataset_name regisss/coco_2017 \
119118
--image_column image \
120119
--caption_column caption \
121120
--remove_unused_columns=False \
@@ -136,7 +135,6 @@ python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
136135
--torch_compile_backend=hpu_backend \
137136
--torch_compile \
138137
--logging_nan_inf_filter \
139-
--trust_remote_code
140138
```
141139

142140
> [!NOTE]
@@ -159,7 +157,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridget
159157
--output_dir /tmp/bridgetower-test \
160158
--model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
161159
--dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \
162-
--dataset_revision 3c6c4f6c0ff7e902833d3afa5f8f3875c2b036e6 \
163160
--image_column image --caption_column image_description \
164161
--remove_unused_columns=False \
165162
--do_train --do_eval --do_predict \
@@ -172,8 +169,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridget
172169
--throughput_warmup_steps 3 \
173170
--logging_steps 10 \
174171
--dataloader_num_workers 1 \
175-
--mediapipe_dataloader \
176-
--trust_remote_code \
177172
--sdp_on_bf16
178173
```
179174

@@ -190,7 +185,7 @@ For instance, you can run inference with CLIP on COCO on 1 Gaudi card with the f
190185
PT_HPU_LAZY_MODE=1 python run_clip.py \
191186
--output_dir ./clip-roberta-finetuned \
192187
--model_name_or_path ./clip-roberta \
193-
--dataset_name sentence-transformers/coco-captions \
188+
--dataset_name regisss/coco_2017 \
194189
--image_column image \
195190
--caption_column caption \
196191
--remove_unused_columns=False \
@@ -204,7 +199,6 @@ PT_HPU_LAZY_MODE=1 python run_clip.py \
204199
--bf16 \
205200
--sdp_on_bf16 \
206201
--mediapipe_dataloader \
207-
--trust_remote_code
208202
```
209203

210204
> [!NOTE]

examples/speech-recognition/README.md

Lines changed: 6 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -231,13 +231,11 @@ recognition on one of the well known speech recognition datasets similar to show
231231
We can load all components of the Whisper model directly from the pretrained checkpoint, including the pretrained model weights, feature extractor and tokenizer. We simply have to specify our fine-tuning dataset and training hyperparameters.
232232

233233
### Single HPU Whisper Fine tuning with Seq2Seq
234-
The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using a single HPU device in bf16 precision:
234+
The following example shows how to fine-tune the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using a single HPU device in bf16 precision:
235235
```bash
236236
PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
237237
--model_name_or_path="openai/whisper-small" \
238-
--dataset_name="mozilla-foundation/common_voice_11_0" \
239-
--trust_remote_code \
240-
--dataset_config_name="hi" \
238+
--dataset_name="regisss/common_voice_11_0_hi" \
241239
--language="hindi" \
242240
--task="transcribe" \
243241
--train_split_name="train+validation" \
@@ -277,14 +275,12 @@ If training on a different language, you should be sure to change the `language`
277275

278276

279277
### Multi HPU Whisper Training with Seq2Seq
280-
The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
278+
The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using 8 HPU devices in half-precision:
281279
```bash
282280
PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
283281
--world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
284282
--model_name_or_path="openai/whisper-large" \
285-
--dataset_name="mozilla-foundation/common_voice_11_0" \
286-
--trust_remote_code \
287-
--dataset_config_name="hi" \
283+
--dataset_name="regisss/common_voice_11_0_hi" \
288284
--language="hindi" \
289285
--task="transcribe" \
290286
--train_split_name="train+validation" \
@@ -317,14 +313,12 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
317313

318314
#### Single HPU Seq2Seq Inference
319315

320-
The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 1 HPU devices in half-precision:
316+
The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/regisss/common_voice_11_0_hi) using 1 HPU devices in half-precision:
321317

322318
```bash
323319
PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
324320
--model_name_or_path="openai/whisper-small" \
325-
--dataset_name="mozilla-foundation/common_voice_11_0" \
326-
--trust_remote_code \
327-
--dataset_config_name="hi" \
321+
--dataset_name="regisss/common_voice_11_0_hi" \
328322
--language="hindi" \
329323
--task="transcribe" \
330324
--eval_split_name="test" \

examples/text-generation/README.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -215,19 +215,18 @@ You can also provide the name of a dataset from the Hugging Face Hub to perform
215215
216216
By default, the first column in the dataset of type `string` will be used as prompts. You can also select the column you want with the argument `--column_name`.
217217
218-
Here is an example with [JulesBelveze/tldr_news](https://huggingface.co/datasets/JulesBelveze/tldr_news):
218+
Here is an example with [dim/tldr_news](https://huggingface.co/datasets/dim/tldr_news):
219219
```bash
220220
PT_HPU_LAZY_MODE=1 python run_generation.py \
221221
--model_name_or_path gpt2 \
222222
--batch_size 2 \
223223
--max_new_tokens 100 \
224224
--use_hpu_graphs \
225225
--use_kv_cache \
226-
--dataset_name JulesBelveze/tldr_news \
226+
--dataset_name dim/tldr_news \
227227
--column_name content \
228228
--bf16 \
229229
--sdp_on_bf16 \
230-
--trust_remote_code
231230
```
232231
233232
> The prompt length is limited to 16 tokens. Prompts longer than this will be truncated.

examples/translation/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ The task of translation supports only custom JSONLINES files, with each line bei
103103
```
104104
Here the languages are Romanian (`ro`) and English (`en`).
105105

106-
If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name stas/wmt14-en-de-pre-processed`, as follows:
106+
If you want to use a pre-processed dataset that leads to high BLEU scores, but for the `en-de` language pair, you can use `--dataset_name regisss/wmt14-en-de-pre-processed`, as follows:
107107

108108
```bash
109109
PT_HPU_LAZY_MODE=1 python run_translation.py \
@@ -113,7 +113,7 @@ PT_HPU_LAZY_MODE=1 python run_translation.py \
113113
--source_lang en \
114114
--target_lang de \
115115
--source_prefix "translate English to German: " \
116-
--dataset_name stas/wmt14-en-de-pre-processed \
116+
--dataset_name regisss/wmt14-en-de-pre-processed \
117117
--output_dir /tmp/tst-translation \
118118
--per_device_train_batch_size 4 \
119119
--per_device_eval_batch_size 4 \

0 commit comments

Comments
 (0)