Skip to content

Commit 3f43050

Browse files
author
Grzegorz Pluto-Prondzinski
authored
Bump datasets to ≥4.0.0 across examples; keep LM-Eval on <4.0.0 (#2250)
1 parent 67629d3 commit 3f43050

55 files changed

Lines changed: 98 additions & 119 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
datasets[audio]>=1.14.0
1+
datasets[audio]>=4.0.0
22
evaluate
33
numba==0.60.0
44
librosa

examples/audio-classification/run_audio_classification.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ def check_optimum_habana_min_version(*a, **b):
4848
check_min_version("4.55.0")
4949
check_optimum_habana_min_version("1.19.0.dev0")
5050

51-
require_version("datasets>=1.14.0", "To fix: pip install -r examples/pytorch/audio-classification/requirements.txt")
51+
require_version("datasets>=4.0.0", "To fix: pip install -r examples/pytorch/audio-classification/requirements.txt")
5252

5353

5454
def random_subsample(wav: np.ndarray, max_length: float, sample_rate: int = 16000):

examples/contrastive-image-text/README.md

Lines changed: 11 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -30,30 +30,15 @@ First, you should install the requirements:
3030
pip install -r requirements.txt
3131
```
3232

33-
## Download COCO dataset (2017)
34-
This example uses COCO dataset (2017) through a custom dataset script, which requires users to manually download the
35-
COCO dataset before training.
36-
37-
```bash
38-
mkdir data
39-
cd data
40-
wget http://images.cocodataset.org/zips/train2017.zip
41-
wget http://images.cocodataset.org/zips/val2017.zip
42-
wget http://images.cocodataset.org/zips/test2017.zip
43-
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
44-
wget http://images.cocodataset.org/annotations/image_info_test2017.zip
45-
cd ..
46-
```
47-
48-
Having downloaded COCO dataset manually you should be able to load with the `ydshieh/coco_dataset_script` dataset loading script:
33+
## Dataset
4934

35+
**Recommended (datasets>=4.0.0):** use the COCO captions dataset hosted on the Hub. It provides image–caption pairs and does **not** require `trust_remote_code`:
5036
```python
51-
import os
5237
import datasets
53-
54-
COCO_DIR = os.path.join(os.getcwd(), "data")
55-
ds = datasets.load_dataset("ydshieh/coco_dataset_script", "2017", data_dir=COCO_DIR)
38+
ds = datasets.load_dataset("sentence-transformers/coco-captions", split="train")
5639
```
40+
This dataset exposes at least the columns `image` (PIL image) and `caption` (string).
41+
If you prefer local files, you can also use the built-in Datasets `imagefolder` builder (not a placeholder) to load images/captions from a directory (it typically expects a small CSV/JSON with columns such as `image_path` and `caption`).
5742

5843
## CLIP-like models
5944

@@ -99,10 +84,8 @@ Run the following command for single-device training:
9984
python run_clip.py \
10085
--output_dir ./clip-roberta-finetuned \
10186
--model_name_or_path ./clip-roberta \
102-
--data_dir $PWD/data \
103-
--dataset_name ydshieh/coco_dataset_script \
104-
--dataset_config_name=2017 \
105-
--image_column image_path \
87+
--dataset_name sentence-transformers/coco-captions \
88+
--image_column image \
10689
--caption_column caption \
10790
--remove_unused_columns=False \
10891
--do_train --do_eval \
@@ -132,10 +115,8 @@ PT_ENABLE_INT64_SUPPORT=1 \
132115
python3 ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
133116
--output_dir=/tmp/clip_roberta \
134117
--model_name_or_path=./clip-roberta \
135-
--data_dir $PWD/data \
136-
--dataset_name ydshieh/coco_dataset_script \
137-
--dataset_config_name 2017 \
138-
--image_column image_path \
118+
--dataset_name sentence-transformers/coco-captions \
119+
--image_column image \
139120
--caption_column caption \
140121
--remove_unused_columns=False \
141122
--do_train --do_eval \
@@ -209,10 +190,8 @@ For instance, you can run inference with CLIP on COCO on 1 Gaudi card with the f
209190
PT_HPU_LAZY_MODE=1 python run_clip.py \
210191
--output_dir ./clip-roberta-finetuned \
211192
--model_name_or_path ./clip-roberta \
212-
--data_dir $PWD/data \
213-
--dataset_name ydshieh/coco_dataset_script \
214-
--dataset_config_name=2017 \
215-
--image_column image_path \
193+
--dataset_name sentence-transformers/coco-captions \
194+
--image_column image \
216195
--caption_column caption \
217196
--remove_unused_columns=False \
218197
--do_eval \
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
datasets>=1.8.0
1+
datasets>=4.0.0

examples/contrastive-image-text/run_bridgetower.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ def check_optimum_habana_min_version(*a, **b):
6060
check_min_version("4.55.0")
6161
check_optimum_habana_min_version("1.19.0.dev0")
6262

63-
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/contrastive-image-text/requirements.txt")
63+
require_version("datasets>=4.0.0", "To fix: pip install -r examples/pytorch/contrastive-image-text/requirements.txt")
6464

6565

6666
@dataclass

examples/contrastive-image-text/run_clip.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def check_optimum_habana_min_version(*a, **b):
6363
check_min_version("4.55.0")
6464
check_optimum_habana_min_version("1.19.0.dev0")
6565

66-
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/contrastive-image-text/requirements.txt")
66+
require_version("datasets>=4.0.0", "To fix: pip install -r examples/pytorch/contrastive-image-text/requirements.txt")
6767

6868

6969
@dataclass
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
torch>=1.5.0
22
torchvision>=0.6.0
3-
datasets>=2.14.0
3+
datasets>=4.0.0
44
evaluate
55
scikit-learn == 1.5.2
66
timm>=0.9.16

examples/image-classification/run_image_classification.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def check_optimum_habana_min_version(*a, **b):
6666
check_min_version("4.55.0")
6767
check_optimum_habana_min_version("1.19.0.dev0")
6868

69-
require_version("datasets>=2.14.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt")
69+
require_version("datasets>=4.0.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt")
7070

7171
MODEL_CONFIG_CLASSES = list(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING.keys())
7272
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)

examples/image-to-text/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ Levenshtein
33
sentencepiece != 0.1.92
44
tiktoken
55
blobfile
6-
datasets
6+
datasets>=4.0.0

examples/language-modeling/peft_poly_seq2seq_with_generate.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def check_optimum_habana_min_version(*a, **b):
6161
check_min_version("4.38.0")
6262
check_optimum_habana_min_version("1.10.0")
6363

64-
require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
64+
require_version("datasets>=4.0.0", "To fix: pip install -r examples/pytorch/language-modeling/requirements.txt")
6565

6666

6767
@dataclass
@@ -233,7 +233,7 @@ def main():
233233

234234
# boolq
235235
boolq_dataset = (
236-
load_dataset("super_glue", "boolq", trust_remote_code=model_args.trust_remote_code)
236+
load_dataset("super_glue", "boolq")
237237
.map(
238238
lambda x: {
239239
"input": f"{x['passage']}\nQuestion: {x['question']}\nA. Yes\nB. No\nAnswer:",
@@ -248,7 +248,7 @@ def main():
248248

249249
# multirc
250250
multirc_dataset = (
251-
load_dataset("super_glue", "multirc", trust_remote_code=model_args.trust_remote_code)
251+
load_dataset("super_glue", "multirc")
252252
.map(
253253
lambda x: {
254254
"input": (
@@ -266,7 +266,7 @@ def main():
266266

267267
# rte
268268
rte_dataset = (
269-
load_dataset("super_glue", "rte", trust_remote_code=model_args.trust_remote_code)
269+
load_dataset("super_glue", "rte")
270270
.map(
271271
lambda x: {
272272
"input": (
@@ -284,7 +284,7 @@ def main():
284284

285285
# wic
286286
wic_dataset = (
287-
load_dataset("super_glue", "wic", trust_remote_code=model_args.trust_remote_code)
287+
load_dataset("super_glue", "wic")
288288
.map(
289289
lambda x: {
290290
"input": (

0 commit comments

Comments
 (0)