Skip to content
209 changes: 209 additions & 0 deletions docs/source/BestPractices/Elastic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
# Elastic



## 安装依赖

集群部署K8S,并在集群中部署DLrover,[DLRover](https://github.com/intelligent-machine-learning/dlrover),
`pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`

经过反复测试验证的训练镜像中的其它依赖以及版本:
deepspeed 0.16.5(需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题)
pytorch 2.6.0


## 如何启动
命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外,其他参数与torchrun一致;
dlrover-run 参数如下:
```
usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
[--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID]
[--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
[--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}]
[--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
[-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK]
[--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
[--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT]
[--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint]
[--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT]
[--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH]
[--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check]
[--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE]
training_script

```
在弹性训练中我们需要关注的参数为:

--nnodes NNODES Number of nodes, or the range of nodes in form
<minimum_nodes>:<maximum_nodes>.

--nproc-per-node NPROC_PER_NODE Number of processes per node.
示例:

```bash
model=your model path
dataset=your dataset
output= your output dir
export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置
deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json

dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
--model_type qwen3 \
--train_type lora \
--torch_dtype bfloat16 \
--dataset $dataset \
--num_train_epochs 4 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-7 \
--gradient_accumulation_steps 8 \
--eval_steps 500 \
--save_steps 10 \
--save_total_limit 20 \
--logging_steps 1 \
--output_dir $output \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--temperature 1.0 \
--system You\ are\ a\ helpful\ assistant. \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--dataset_num_proc 1 \
--use_flash_ckpt true \
--deepspeed $deepspeed_config_or_type \
--elastic
```

## 配置文件示例
默认情况下的zero1为以下示例配置,

```json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"bf16": {
"enabled": "auto"
},

"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"elasticity": {
"ignore_non_elastic_batch_info": true,
"enabled": true,
"max_train_batch_size": 8,
"micro_batch_sizes": [
4,
2
],
"min_gpus": 1,
"max_gpus": 4,
"min_time": 20,
"version": 0.1
}
}
```

如果用户需要自定义,可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为:
```json
...

"elasticity": {
"ignore_non_elastic_batch_info": true,
"enabled": true,
"max_train_batch_size": 8,
"micro_batch_sizes": [
4,
2
],
"min_gpus": 1,
"max_gpus": 4,
"min_time": 20,
"version": 0.1
}
```

- ignore_non_elastic_batch_info:代表在elasticity里的配置会忽略外层的batch_size相关的配置,训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数
计算原则为:
 global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size
- max_train_batch_size:最大batch_size数
- micro_batch_sizes:即train_micro_batch_size_per_gpu
- min_gpus:最小gpu数目
- max_gpus:最大gpu数目
更详细的内容见:[Deepspeed](https://www.deepspeed.ai/docs/config-json/#elastic-training-config-v01-and-v02)


## 启动训练

```yaml
---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
name: deepspeed-elastic-swift
namespace: dlrover
spec:
distributionStrategy: AllreduceStrategy
optimizeMode: single-job
replicaSpecs:
worker:
replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】
template:
spec:
restartPolicy: Never
containers:
- name: main
image: #【训练镜像,需要安装deepspeed,dlrover 和swift 】
imagePullPolicy: IfNotPresent
command:
- /bin/bash
- -c
- sh start.sh # 启动脚本
resources:
limits:
cpu: '8'
memory: 16Gi
nvidia.com/gpu: '1'
volumeMounts:
- mountPath: /model
name: volume-model
- mountPath: /dev/shm
name: volume-shm
restartPolicy: Never
volumes:
- hostPath:
path: /model
type: Directory
name: volume-model
- emptyDir:
medium: Memory
sizeLimit: 200Gi
name: volume-shm

```
3 changes: 2 additions & 1 deletion docs/source/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,7 +484,8 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
- eval_dataset_args: 评测数据集参数,json格式,可设置多个数据集的参数。
- eval_limit: 评测数据集采样数。
- eval_generation_config: 评测时模型推理配置,json格式,默认为`{'max_tokens': 512}`。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化,目前暂不支持safetensors格式;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- use_flash_ckpt: 是否启用[DLRover Flash Checkpoint](https://github.com/intelligent-machine-learning/dlrover)的flash checkpoint。默认为`false`,启用后,权重会先保存至共享内存,之后异步持久化;建议搭配`PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"` 一起使用,避免训练过程CUDA OOM。
- elastic: 是否启用弹性,依赖[DLRover](https://github.com/intelligent-machine-learning/dlrover),`pip install dlrover && pip install tornado && pip install kubernetes `,具体使用参考[示例](../BestPractices/Elastic.md)
- early_stop_interval: 早停的间隔,会检验best_metric在early_stop_interval个周期内(基于`save_steps`, 建议`eval_steps`和`save_steps`设为同值)没有提升时终止训练。具体代码在[callback plugin](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/callback.py)中。同时,如果有较为复杂的早停需求,直接覆盖callback.py中的已有实现即可。

#### SWANLAB
Expand Down
Loading