Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 101 additions & 12 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ DataFlex 与 LLaMA-Factory 无缝集成,为研究人员和开发者提供更
| 方法 | 类别 | 是否需要模型参与 | 官方仓库 |
|:----:|:----:|:-------------------------------------:|:-------------:|
| **DOREMI** | 离线混合 | ✅ 是 | ⚠️[official code](https://github.com/sangmichaelxie/doremi) |
| **ODM** | 在线混合 | ✅ 是 | |
| **ODM** | 在线混合 | ✅ 是 | ⚠️[official code](https://github.com/alon-albalak/online-data-mixing) |
</div>

- **Dynamic Weight Trainer(动态样本加权训练器)**:
Expand Down Expand Up @@ -114,17 +114,106 @@ dataflex-cli train examples/train_lora/selectors/less.yaml

<div align="center">

| | Acc ↑ | | Perplexity (PPL) ↓ | | | | | |
|:------:|:--------:|:------:|:---:|:---:|:---:|:---:|:---:|:---:|
| **方法** | **MMLU** | **ALL** | **CC** | **C4** | **SE** | **Wiki** | **GitHub** | **ArXiv** | **Book** |
| | | | **Slim-Pajama-6B** | | | | | |
| Baseline | 25.27 | 4.217 | 4.278 | 4.532 | 3.402 | **3.546** | **2.640** | 3.508 | 4.778 |
| DoReMi | 25.84 | **4.134** | **4.108** | **4.358** | 3.788 | 3.997 | 3.420 | 3.413 | 4.661 |
| ODM | **26.04** | 4.244 | 4.326 | 4.555 | **3.243** | 3.699 | 2.704 | **2.904** | **4.613** |
| | | | **Slim-Pajama-30B** | | | | | |
| Baseline | 25.51 | 3.584 | 3.723 | 3.505 | 2.850 | 3.215 | 3.163 | 4.540 | 5.329 |
| DoReMi | **25.97** | 3.562 | 3.731 | **3.503** | 2.706 | 2.985 | 2.973 | 4.441 | 5.214 |
| ODM | 25.63 | **3.429** | **3.598** | 3.519 | **2.382** | **2.713** | **2.255** | **3.487** | **4.746** |
<table>
<thead>
<tr>
<th rowspan="2">方法</th>
<th colspan="1">Acc ↑</th>
<th colspan="8">Perplexity (PPL) ↓</th>
</tr>
<tr>
<th>MMLU</th>
<th>ALL</th>
<th>CC</th>
<th>C4</th>
<th>SE</th>
<th>Wiki</th>
<th>GitHub</th>
<th>ArXiv</th>
<th>Book</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="10"><b>Slim-Pajama-6B</b></td>
</tr>
<tr>
<td>Baseline</td>
<td>25.27</td>
<td>4.217</td>
<td>4.278</td>
<td>4.532</td>
<td>3.402</td>
<td><b>3.546</b></td>
<td><b>2.640</b></td>
<td>3.508</td>
<td>4.778</td>
</tr>
<tr>
<td>DoReMi</td>
<td>25.84</td>
<td><b>4.134</b></td>
<td><b>4.108</b></td>
<td><b>4.358</b></td>
<td>3.788</td>
<td>3.997</td>
<td>3.420</td>
<td>3.413</td>
<td>4.661</td>
</tr>
<tr>
<td>ODM</td>
<td><b>26.04</b></td>
<td>4.244</td>
<td>4.326</td>
<td>4.555</td>
<td><b>3.243</b></td>
<td>3.699</td>
<td>2.704</td>
<td><b>2.904</b></td>
<td><b>4.613</b></td>
</tr>
<tr>
<td colspan="10"><b>Slim-Pajama-30B</b></td>
</tr>
<tr>
<td>Baseline</td>
<td>25.51</td>
<td>3.584</td>
<td>3.723</td>
<td>3.505</td>
<td>2.850</td>
<td>3.215</td>
<td>3.163</td>
<td>4.540</td>
<td>5.329</td>
</tr>
<tr>
<td>DoReMi</td>
<td><b>25.97</b></td>
<td>3.562</td>
<td>3.731</td>
<td><b>3.503</b></td>
<td>2.706</td>
<td>2.985</td>
<td>2.973</td>
<td>4.441</td>
<td>5.214</td>
</tr>
<tr>
<td>ODM</td>
<td>25.63</td>
<td><b>3.429</b></td>
<td><b>3.598</b></td>
<td>3.519</td>
<td><b>2.382</b></td>
<td><b>2.713</b></td>
<td><b>2.255</b></td>
<td><b>3.487</b></td>
<td><b>4.746</b></td>
</tr>
</tbody>
</table>

</div>

Expand Down
113 changes: 101 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ We summarize repositories related to Data Selection, Data Mixture, and Data Rewe
| Method | Category | Requires Model-in-the-Loop? | Official Repo |
|:------:|:--------:|:---------------------------:|:-------------:|
| **DOREMI** | Offline Mixture | ✅ Yes | ⚠️[official code](https://github.com/sangmichaelxie/doremi) |
| **ODM** | Online Mixture | ✅ Yes | |
| **ODM** | Online Mixture | ✅ Yes | ⚠️[official code](https://github.com/alon-albalak/online-data-mixing) |

</div>

Expand Down Expand Up @@ -117,17 +117,106 @@ We use subsets of [SlimPajama-627B](https://huggingface.co/datasets/cerebras/Sli

<div align="center">

| | Acc ↑ | | Perplexity (PPL) ↓ | | | | | |
|:------:|:--------:|:------:|:---:|:---:|:---:|:---:|:---:|:---:|
| **Method** | **MMLU** | **ALL** | **CC** | **C4** | **SE** | **Wiki** | **GitHub** | **ArXiv** | **Book** |
| | | | **Slim-Pajama-6B** | | | | | |
| Baseline | 25.27 | 4.217 | 4.278 | 4.532 | 3.402 | **3.546** | **2.640** | 3.508 | 4.778 |
| DoReMi | 25.84 | **4.134** | **4.108** | **4.358** | 3.788 | 3.997 | 3.420 | 3.413 | 4.661 |
| ODM | **26.04** | 4.244 | 4.326 | 4.555 | **3.243** | 3.699 | 2.704 | **2.904** | **4.613** |
| | | | **Slim-Pajama-30B** | | | | | |
| Baseline | 25.51 | 3.584 | 3.723 | 3.505 | 2.850 | 3.215 | 3.163 | 4.540 | 5.329 |
| DoReMi | **25.97** | 3.562 | 3.731 | **3.503** | 2.706 | 2.985 | 2.973 | 4.441 | 5.214 |
| ODM | 25.63 | **3.429** | **3.598** | 3.519 | **2.382** | **2.713** | **2.255** | **3.487** | **4.746** |
<table>
<thead>
<tr>
<th rowspan="2">Method</th>
<th colspan="1">Acc ↑</th>
<th colspan="8">Perplexity (PPL) ↓</th>
</tr>
<tr>
<th>MMLU</th>
<th>ALL</th>
<th>CC</th>
<th>C4</th>
<th>SE</th>
<th>Wiki</th>
<th>GitHub</th>
<th>ArXiv</th>
<th>Book</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="10"><b>Slim-Pajama-6B</b></td>
</tr>
<tr>
<td>Baseline</td>
<td>25.27</td>
<td>4.217</td>
<td>4.278</td>
<td>4.532</td>
<td>3.402</td>
<td><b>3.546</b></td>
<td><b>2.640</b></td>
<td>3.508</td>
<td>4.778</td>
</tr>
<tr>
<td>DoReMi</td>
<td>25.84</td>
<td><b>4.134</b></td>
<td><b>4.108</b></td>
<td><b>4.358</b></td>
<td>3.788</td>
<td>3.997</td>
<td>3.420</td>
<td>3.413</td>
<td>4.661</td>
</tr>
<tr>
<td>ODM</td>
<td><b>26.04</b></td>
<td>4.244</td>
<td>4.326</td>
<td>4.555</td>
<td><b>3.243</b></td>
<td>3.699</td>
<td>2.704</td>
<td><b>2.904</b></td>
<td><b>4.613</b></td>
</tr>
<tr>
<td colspan="10"><b>Slim-Pajama-30B</b></td>
</tr>
<tr>
<td>Baseline</td>
<td>25.51</td>
<td>3.584</td>
<td>3.723</td>
<td>3.505</td>
<td>2.850</td>
<td>3.215</td>
<td>3.163</td>
<td>4.540</td>
<td>5.329</td>
</tr>
<tr>
<td>DoReMi</td>
<td><b>25.97</b></td>
<td>3.562</td>
<td>3.731</td>
<td><b>3.503</b></td>
<td>2.706</td>
<td>2.985</td>
<td>2.973</td>
<td>4.441</td>
<td>5.214</td>
</tr>
<tr>
<td>ODM</td>
<td>25.63</td>
<td><b>3.429</b></td>
<td><b>3.598</b></td>
<td>3.519</td>
<td><b>2.382</b></td>
<td><b>2.713</b></td>
<td><b>2.255</b></td>
<td><b>3.487</b></td>
<td><b>4.746</b></td>
</tr>
</tbody>
</table>

</div>

Expand Down
84 changes: 84 additions & 0 deletions data/dataset_info.json
Original file line number Diff line number Diff line change
Expand Up @@ -638,6 +638,90 @@
"prompt": "text"
}
},
"RedPajamaArXiv-6B": {
"file_name": "RedPajamaArXiv-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaBook-6B": {
"file_name": "RedPajamaBook-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaC4-6B": {
"file_name": "RedPajamaC4-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaCommonCrawl-6B": {
"file_name": "RedPajamaCommonCrawl-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaGithub-6B": {
"file_name": "RedPajamaGithub-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaStackExchange-6B": {
"file_name": "RedPajamaStackExchange-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaWikipedia-6B": {
"file_name": "RedPajamaWikipedia-6B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaArXiv-30B": {
"file_name": "RedPajamaArXiv-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaBook-30B": {
"file_name": "RedPajamaBook-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaC4-30B": {
"file_name": "RedPajamaC4-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaCommonCrawl-30B": {
"file_name": "RedPajamaCommonCrawl-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaGithub-30B": {
"file_name": "RedPajamaGithub-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaStackExchange-30B": {
"file_name": "RedPajamaStackExchange-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"RedPajamaWikipedia-30B": {
"file_name": "RedPajamaWikipedia-30B.jsonl",
"columns": {
"prompt": "text"
}
},
"c4_demo": {
"file_name": "c4_demo.jsonl",
"columns": {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ output_dir: ../dataflex_saves/Qwen2.5-0.5B/doremi_step1_static_qwen_pt_full_resu
logging_steps: 10
save_steps: 100
plot_loss: true
save_only_model: false
save_only_model: true
overwrite_output_dir: true

### swanlab
Expand All @@ -53,9 +53,9 @@ ddp_timeout: 180000000
### dynamic_train - DoReMi Step 1: Reference Model Training
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: static # 使用静态混合器
mixture_sample_rule: mixture # 初始采样规则,mixture为根据init_mixture_proportions比例混合
init_mixture_proportions: [0.5, 0.5] # 对应初始的比例,这里使用均匀分布作为参考权重
component_name: static # use static mixer
mixture_sample_rule: mixture # initial sampling rule, mixture is according to the init_mixture_proportions ratio
init_mixture_proportions: [0.5, 0.5] # corresponding initial proportions, here use uniform distribution as reference weight
static_mix: true
warmup_step: 100
update_step: 200
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ output_dir: ../dataflex_saves/Qwen2.5-0.5B/doremi_step2_dynamic_qwen_pt_full_res
logging_steps: 10
save_steps: 100
plot_loss: true
save_only_model: false
save_only_model: true
overwrite_output_dir: true

### swanlab
Expand All @@ -42,7 +42,7 @@ report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
gradient_accumulation_steps: 8
learning_rate: 5.0e-5
num_train_epochs: 1.0
lr_scheduler_type: linear
Expand All @@ -54,8 +54,8 @@ ddp_timeout: 180000000
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: doremi
mixture_sample_rule: mixture # 初始采样规则,mixture为根据init_mixture_proportions比例混合(可动态调整),stratified为固定按源数据集大小比例分层,uniform为固定均匀分布
init_mixture_proportions: [0.5, 0.5] # 对应初始的比例,可通过额外算法自行调整
mixture_sample_rule: mixture # initial sampling rule, mixture is according to the init_mixture_proportions ratio (can be dynamically adjusted), stratified is fixed according to the source dataset size ratio, uniform is fixed uniform distribution
init_mixture_proportions: [0.5, 0.5] # reference weight
warmup_step: 100
update_step: 200
update_times: 3
Expand Down
Loading
Loading