Skip to content

Commit 3547a26

Browse files
committed
add ultrafeedback and fineweb #4085 #4132
Former-commit-id: 12d79f8
1 parent de9e773 commit 3547a26

File tree

4 files changed

+38
-23
lines changed

4 files changed

+38
-23
lines changed

.github/workflows/tests.yml

+1-23
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ on:
1717
- ".github/workflows/*.yml"
1818

1919
jobs:
20-
check_code_quality:
20+
tests:
2121
runs-on: ubuntu-latest
2222
steps:
2323
- uses: actions/checkout@v4
@@ -34,28 +34,6 @@ jobs:
3434
- name: Check quality
3535
run: |
3636
make style && make quality
37-
38-
pytest:
39-
needs: check_code_quality
40-
strategy:
41-
matrix:
42-
python-version:
43-
- "3.8"
44-
os:
45-
- "ubuntu-latest"
46-
runs-on: ${{ matrix.os }}
47-
steps:
48-
- uses: actions/checkout@v4
49-
- name: Set up Python ${{ matrix.python-version }}
50-
uses: actions/setup-python@v5
51-
with:
52-
python-version: ${{ matrix.python-version }}
53-
cache: "pip"
54-
cache-dependency-path: "setup.py"
55-
- name: Install dependencies
56-
run: |
57-
python -m pip install --upgrade pip
58-
python -m pip install .[torch,dev]
5937
- name: Test with pytest
6038
run: |
6139
make test

README.md

+3
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,8 @@ You also can add a custom chat template to [template.py](src/llamafactory/data/t
214214
- [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)
215215
- [Pile (en)](https://huggingface.co/datasets/EleutherAI/pile)
216216
- [SkyPile (zh)](https://huggingface.co/datasets/Skywork/SkyPile-150B)
217+
- [FineWeb (en)](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
218+
- [FineWeb-Edu (en)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
217219
- [The Stack (en)](https://huggingface.co/datasets/bigcode/the-stack)
218220
- [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata)
219221

@@ -273,6 +275,7 @@ You also can add a custom chat template to [template.py](src/llamafactory/data/t
273275
<details><summary>Preference datasets</summary>
274276

275277
- [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
278+
- [UltraFeedback (en)](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
276279
- [Orca DPO Pairs (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
277280
- [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
278281
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)

README_zh.md

+3
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,8 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
214214
- [Wikipedia (zh)](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered)
215215
- [Pile (en)](https://huggingface.co/datasets/EleutherAI/pile)
216216
- [SkyPile (zh)](https://huggingface.co/datasets/Skywork/SkyPile-150B)
217+
- [FineWeb (en)](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
218+
- [FineWeb-Edu (en)](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
217219
- [The Stack (en)](https://huggingface.co/datasets/bigcode/the-stack)
218220
- [StarCoder (en)](https://huggingface.co/datasets/bigcode/starcoderdata)
219221

@@ -273,6 +275,7 @@ https://github.com/hiyouga/LLaMA-Factory/assets/16256802/ec36a9dd-37f4-4f72-81bd
273275
<details><summary>偏好数据集</summary>
274276

275277
- [DPO mixed (en&zh)](https://huggingface.co/datasets/hiyouga/DPO-En-Zh-20k)
278+
- [UltraFeedback (en)](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
276279
- [Orca DPO Pairs (en)](https://huggingface.co/datasets/Intel/orca_dpo_pairs)
277280
- [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
278281
- [Nectar (en)](https://huggingface.co/datasets/berkeley-nest/Nectar)

data/dataset_info.json

+31
Original file line numberDiff line numberDiff line change
@@ -391,6 +391,16 @@
391391
"rejected": "rejected"
392392
}
393393
},
394+
"ultrafeedback": {
395+
"hf_hub_url": "llamafactory/ultrafeedback_binarized",
396+
"ms_hub_url": "llamafactory/ultrafeedback_binarized",
397+
"ranking": true,
398+
"columns": {
399+
"prompt": "instruction",
400+
"chosen": "chosen",
401+
"rejected": "rejected"
402+
}
403+
},
394404
"orca_pairs": {
395405
"hf_hub_url": "Intel/orca_dpo_pairs",
396406
"ranking": true,
@@ -448,6 +458,15 @@
448458
"assistant_tag": "assistant"
449459
}
450460
},
461+
"ultrafeedback_kto": {
462+
"hf_hub_url": "argilla/ultrafeedback-binarized-preferences-cleaned-kto",
463+
"ms_hub_url": "AI-ModelScope/ultrafeedback-binarized-preferences-cleaned-kto",
464+
"columns": {
465+
"prompt": "prompt",
466+
"response": "completion",
467+
"kto_tag": "label"
468+
}
469+
},
451470
"wiki_demo": {
452471
"file_name": "wiki_demo.txt",
453472
"columns": {
@@ -501,6 +520,18 @@
501520
"prompt": "text"
502521
}
503522
},
523+
"fileweb": {
524+
"hf_hub_url": "HuggingFaceFW/fineweb",
525+
"columns": {
526+
"prompt": "text"
527+
}
528+
},
529+
"fileweb_edu": {
530+
"hf_hub_url": "HuggingFaceFW/fineweb-edu",
531+
"columns": {
532+
"prompt": "text"
533+
}
534+
},
504535
"the_stack": {
505536
"hf_hub_url": "bigcode/the-stack",
506537
"ms_hub_url": "AI-ModelScope/the-stack",

0 commit comments

Comments
 (0)