DataFlex is a data-centric training system built on top of LLaMA-Factory. It supports dynamic data selection, dynamic data mixture, and dynamic data reweighting during LLM training.
git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .
pip install llamafactory==0.9.3DataFlex provides a single CLI entry point:
# Check version
dataflex-cli version
# Run training
dataflex-cli train <config.yaml> [key=value overrides ...]OmegaConf-style overrides can be appended after the YAML path:
dataflex-cli train examples/train_lora/selectors/less.yaml learning_rate=5e-5 warmup_step=20For multi-GPU training, use FORCE_TORCHRUN:
FORCE_TORCHRUN=1 dataflex-cli train examples/train_lora/selectors/less.yamlA DataFlex training config is a standard LlamaFactory YAML with additional DataFlex-specific fields.
| Section | Key Fields |
|---|---|
| Model | model_name_or_path, trust_remote_code |
| Method | stage (pt/sft), do_train, finetuning_type (lora/freeze/full), lora_rank, lora_alpha, lora_target |
| Dataset | dataset, template, cutoff_len, overwrite_cache, preprocessing_num_workers |
| Output | output_dir, logging_steps, save_steps, overwrite_output_dir |
| Training | per_device_train_batch_size, gradient_accumulation_steps, learning_rate, num_train_epochs, lr_scheduler_type, bf16 |
| DeepSpeed | deepspeed (path to ds config JSON) |
| Field | Type | Description |
|---|---|---|
train_type |
str | Training mode. One of: dynamic_select, dynamic_mix, dynamic_weight, static |
components_cfg_file |
str | Path to the components config YAML (default: src/dataflex/configs/components.yaml) |
component_name |
str | Which algorithm to use, matching a key in components_cfg_file |
warmup_step |
int | Number of warmup steps before dynamic behavior kicks in |
update_step |
int | Interval (in steps) between dynamic updates |
update_times |
int | For dynamic_select, number of dynamic updates per Flex epoch. Use -1 for dataset-sized epochs with continuous updates |
static_mix |
bool | If true with dynamic_mix, use fixed proportions (no dynamic updates). Used in DoReMi Step 1 & 3 |
train_step |
int | Optional total training steps. If set to a positive value, it overrides num_train_epochs-derived steps |
DataFlex dynamic trainers run a step-based training loop internally. Prefer eval_strategy: "steps" / save_strategy: "steps" or disable them with "no" when using dynamic training. Epoch-based evaluation or saving depends on the internal step-to-epoch bookkeeping and may not align with Flex epoch boundaries.
For dynamic_select, num_train_epochs repeats Flex epochs. One Flex epoch contains:
warmup_step + update_step * update_times
For example:
warmup_step: 10
update_step: 10
update_times: 2
num_train_epochs: 3This runs 3 * (10 + 10 * 2) = 90 optimization steps. To keep the old single-Flex-epoch behavior, set:
num_train_epochs: 1.0If train_step > 0, DataFlex uses train_step as the exact total number of optimization steps and does not derive total steps from num_train_epochs.
For multi-epoch tests, make sure example configs do not leave a positive train_step; pass train_step=0 on the CLI if you want num_train_epochs to control training length.
For dynamic_weight, warmup_step is a global step threshold. Reweighting starts when global_step >= warmup_step and does not reset at epoch boundaries.
| Field | Type | Description |
|---|---|---|
mixture_sample_rule |
str | Sampling rule: mixture (proportional), stratified (by dataset size), uniform |
init_mixture_proportions |
list[float] | Initial proportions for each source dataset, e.g. [0.5, 0.5] |
Dynamically selects a subset of training samples at regular intervals based on model state.
train_type: dynamic_select
components_cfg_file: src/dataflex/configs/components.yaml
component_name: less # choices: less, nice, loss, delta_loss, tsds, near, random, custom
warmup_step: 10
update_step: 10
update_times: 2
num_train_epochs: 1.0How it works:
- Warmup phase: train on randomly sampled data for
warmup_stepsteps. - At
warmup_stepand everyupdate_stepsteps: pause training, run the selector to pick new samples, rebuild the dataloader. - One Flex epoch has
warmup_step + update_step * update_timessteps. - Total steps are derived from
num_train_epochsunlesstrain_step > 0.
Example:
dataflex-cli train examples/train_lora/selectors/less.yamlDynamically adjusts the proportions of data from multiple source datasets. Your dataset field should list multiple datasets separated by commas (e.g., dataset: wiki_demo,c4_demo).
train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: doremi # choices: doremi, odm, random, static
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]
warmup_step: 100
update_step: 200
update_times: 3train_type: dynamic_mix
components_cfg_file: src/dataflex/configs/components.yaml
component_name: static
static_mix: true
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]
train_step: 1000Example:
dataflex-cli train examples/train_lora/mixers/doremi_step2_dynamic_qwen_pt_lora.yamlDynamically adjusts per-sample loss weights during backpropagation based on sample characteristics.
train_type: dynamic_weight
components_cfg_file: src/dataflex/configs/components.yaml
component_name: loss # choices: loss, custom
warmup_step: 100
train_step: 500 # fixed-step example; set to 0 for num_train_epochs-based multi-epoch runsHow it works:
- Standard training for
warmup_stepsteps (no reweighting). - After warmup: each training step computes per-sample losses and applies the weighting strategy.
warmup_stepis measured in global optimization steps and does not reset per epoch.- If
train_step > 0, total steps =train_step; otherwise total steps follow the standardnum_train_epochscalculation.
Example:
dataflex-cli train examples/train_lora/weighters/loss.yamlThe components.yaml file defines algorithm-specific parameters. It has three top-level sections:
selectors:
algorithm_name:
name: algorithm_name
params:
param1: value1
param2: value2
mixers:
algorithm_name:
name: algorithm_name
params:
...
weighters:
algorithm_name:
name: algorithm_name
params:
...You select which algorithm to use via component_name in your training YAML.
| Algorithm | component_name |
Category | Description |
|---|---|---|---|
| LESS | less |
Gradient-based | Selects samples based on gradient similarity to validation set |
| NICE | nice |
Gradient-based | Neural network-based importance sampling with reward model |
| Loss | loss |
Loss-based | Selects samples based on current training loss |
| Delta Loss | delta_loss |
Loss-based | Selects based on loss change over a sliding window |
| TSDS | tsds |
Distribution-based | Task-specific data selection using pre-computed probabilities |
| NEAR | near |
Distribution-based | Nearest-neighbor based selection using pre-computed indices |
| Random | random |
Random | Uniform random sampling |
| Custom | custom |
Custom | Template for user-defined selection logic |
| Algorithm | component_name |
Category | Description |
|---|---|---|---|
| DoReMi | doremi |
Offline | Domain reweighting with minimax optimization (3-step pipeline) |
| ODM | odm |
Online | Online data mixing using Exp3 multi-armed bandit |
| Static | static |
Fixed | Fixed proportions throughout training |
| Random | random |
Random | Random domain proportions |
| Algorithm | component_name |
Category | Description |
|---|---|---|---|
| Loss Reweighting | loss |
Loss-based | Strategies: linupper, uniform, quadratic, extremes |
| Custom | custom |
Custom | Template for user-defined weighting logic |
Some selectors require offline preprocessing before training:
Generates sampling probabilities based on embedding similarity between candidate and target data.
python src/dataflex/offline_selector/offline_tsds_selector.pyProduces tsds_probs.npy — set the path in components.yaml under selectors.tsds.params.probs_path.
Computes nearest-neighbor indices between candidate and query datasets.
python src/dataflex/offline_selector/offline_near_selector.pyProduces top_indices.npy — set the path in components.yaml under selectors.near.params.indices_path.
DoReMi requires a 3-step pipeline:
-
Step 1 — Train a reference model with static uniform/given proportions:
dataflex-cli train examples/train_full/mixers/doremi_step1_static_qwen_pt_full.yaml
-
Step 2 — Train a proxy model with dynamic DoReMi mixing, using the Step 1 checkpoint as reference:
dataflex-cli train examples/train_full/mixers/doremi_step2_dynamic_qwen_pt_full.yaml
This outputs optimized domain weights.
-
Step 3 — Train the final model with static proportions set to the optimized weights from Step 2:
dataflex-cli train examples/train_full/mixers/doremi_step3_static_qwen_pt_full.yaml
All example configs are in the examples/ directory:
examples/
├── train_lora/
│ ├── selectors/ # LESS, NICE, Loss, Delta Loss, TSDS, NEAR, Random, Custom
│ ├── mixers/ # DoReMi Step 2 (LoRA), Random
│ └── weighters/ # Loss, Custom
├── train_full/
│ └── mixers/ # DoReMi Steps 1-3 (full), ODM (full)
├── test/ # minimal smoke-test configs
├── merge_lora/ # LoRA merge configs (use llamafactory-cli export)
├── deepspeed/ # DeepSpeed ZeRO configs
└── accelerate/ # FSDP configs
DataFlex is fully compatible with LlamaFactory. Any standard LlamaFactory YAML works with dataflex-cli train — if train_type is not specified or set to static, DataFlex uses the default LlamaFactory trainer with no modifications.
For operations like model export/merge, continue using llamafactory-cli:
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml