Elastic

Installing Dependencies

Deploy a K8S cluster and deploy DLRover in the cluster, and install the required packages using pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift

Other dependencies and versions verified through repeated testing in the training image: deepspeed 0.16.5 (refer to this PR to fix issues related to universal checkpoint) pytorch 2.6.0

How to Start

Enable elastic training by adding the deepspeed_elastic callback (optionally graceful_exit) in --callbacks, and configure DeepSpeed elasticity settings.

The command format is dlrover-run + DLrover command parameters + Swift startup command + Swift parameters.dlrover-run behaves like torchrun for most arguments, except for its custom parameters.

The dlrover-run arguments are as follows:

usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
                   [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID]
                   [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
                   [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}]
                   [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
                   [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK]
                   [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
                   [--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT]
                   [--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint]
                   [--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT]
                   [--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH]
                   [--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check]
                   [--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE]
                   training_script

In elastic training, the parameters you may pay attention to focus on are:

--nnodes NNODES Number of nodes, or the range of nodes in the form <minimum_nodes>:<maximum_nodes>.

--nproc-per-node NPROC_PER_NODE Number of processes per node.

Example:

model=your model path
dataset=your dataset
output= your output dir
export CUDA_VISIBLE_DEVICES=0 # Set according to the actual GPU usage
deepspeed_config_or_type=deepspeed type or configuration file path, e.g., zero1 or /xxx/ms-swift/swift/llm/ds_config/zero1.json

dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1  \
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
--model_type qwen3 \
--tuner_type lora  \
--torch_dtype bfloat16 \
--dataset $dataset \
--num_train_epochs 4 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-7 \
--gradient_accumulation_steps 8 \
--eval_steps 500 \
--save_steps 10 \
--save_total_limit 20 \
--logging_steps 1 \
--output_dir $output \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--temperature 1.0 \
--system 'You are a helpful assistant.' \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--dataset_num_proc 1 \
--use_flash_ckpt true \
--callbacks deepspeed_elastic graceful_exit \
--deepspeed $deepspeed_config_or_type \

Configuration

By default, the zero1 configuration is as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "elasticity": {
        "ignore_non_elastic_batch_info": true,
        "enabled": true,
        "max_train_batch_size": 8,
        "micro_batch_sizes": [
          4,
          2
        ],
        "min_gpus": 1,
        "max_gpus": 4,
        "min_time": 20,
        "version": 0.1
      }
}

If users need custom configurations, they can specify the path to the custom zero1.json file in the deepspeed_config_or_type parameter. The elasticity-related configuration is as follows:

...

  "elasticity": {
        "ignore_non_elastic_batch_info": true,
        "enabled": true,
        "max_train_batch_size": 8,
        "micro_batch_sizes": [
          4,
          2
        ],
        "min_gpus": 1,
        "max_gpus": 4,
        "min_time": 20,
        "version": 0.1
      }

ignore_non_elastic_batch_info：Indicates that the batch size configurations outside the elasticity settings will be ignored. During training, the batch size and related parameters will be dynamically adjusted based on the number of training processes. Calculation principle： global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size
max_train_batch_size： Maximum batch size
micro_batch_sizes：List of allowed per-GPU micro-batch sizes under elasticity; candidates for train_micro_batch_size_per_gpu.
min_gpus：Minimum number of GPUs.
max_gpus：Maximum number of GPUs. For more details, see: Deepspeed

Starting Training

---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
  name: deepspeed-elastic-swift
  namespace: dlrover
spec:
  distributionStrategy: AllreduceStrategy
  optimizeMode: single-job
  replicaSpecs:
    worker:
      replicas: 1 # This should match the maximum value of --nnodes NNODES in the startup command
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              image: #【Training image, needs to have deepspeed, dlrover, and swift installed】
              imagePullPolicy: IfNotPresent
              command:
                - /bin/bash
                - -c
                - sh start.sh # Startup script
              resources:
                limits:
                  cpu: '8'
                  memory: 16Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - mountPath: /model
                  name: volume-model
                - mountPath: /dev/shm
                  name: volume-shm
          restartPolicy: Never
          volumes:
            - hostPath:
                path: /model
                type: Directory
              name: volume-model
            - emptyDir:
                medium: Memory
                sizeLimit: 200Gi
              name: volume-shm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic

Installing Dependencies

How to Start

Configuration

Starting Training

FilesExpand file tree

Elastic.md

Latest commit

History

Elastic.md

File metadata and controls

Elastic

Installing Dependencies

How to Start

Configuration

Starting Training