-
Notifications
You must be signed in to change notification settings - Fork 1.1k
support cce、tiledmlp、activation cpu offload #7169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| { | ||
| "_description": "FSDP2 configuration for distributed training (PyTorch native FSDP v2)", | ||
| "_requires": "torch>=2.4.0", | ||
| "_note": "This is the recommended configuration for multi-GPU training without CPU offloading. NOTE: When using FSDP2, do NOT use --gradient_checkpointing, use activation_checkpointing in fsdp_config instead.", | ||
|
|
||
| "_param_docs": { | ||
| "fsdp": "FSDP strategy string. Options: 'full_shard' (ZeRO-3 style, shards params+grads+optimizer), 'shard_grad_op' (ZeRO-2 style, shards grads+optimizer only). Add 'auto_wrap' to enable automatic layer wrapping. Add 'offload' to enable CPU offloading.", | ||
| "fsdp_version": "FSDP version. Use 2 for PyTorch native FSDP2 (recommended). FSDP2 uses DTensor for per-parameter sharding, supports LoRA/QLoRA natively.", | ||
| "auto_wrap_policy": "How to wrap model layers. 'TRANSFORMER_BASED_WRAP' wraps transformer decoder layers (from model._no_split_modules). 'SIZE_BASED_WRAP' wraps modules exceeding min_num_params.", | ||
| "cpu_ram_efficient_loading": "If true, only rank 0 loads full model weights, then broadcasts to other ranks. Reduces CPU RAM usage during initialization.", | ||
| "state_dict_type": "'SHARDED_STATE_DICT' (recommended): each rank saves its own shard without extra communication. 'FULL_STATE_DICT': gathers full model on rank 0 (higher memory, slower).", | ||
| "reshard_after_forward": "true = FULL_SHARD (ZeRO-3), reshards params after forward pass. false = SHARD_GRAD_OP (ZeRO-2), keeps params gathered during forward/backward.", | ||
| "activation_checkpointing": "Use FSDP's native activation checkpointing instead of gradient_checkpointing. This is the correct way to save memory with FSDP.", | ||
| "activation_cpu_offload": "true = offload activations to CPU. false = keep activations on GPU,can enable when using activation_checkpointing." | ||
| }, | ||
| "fsdp": "full_shard auto_wrap", | ||
| "fsdp_config": { | ||
| "fsdp_version": 2, | ||
| "reshard_after_forward": true, | ||
| "auto_wrap_policy": "TRANSFORMER_BASED_WRAP", | ||
| "cpu_ram_efficient_loading": true, | ||
| "state_dict_type": "SHARDED_STATE_DICT", | ||
| "activation_checkpointing": false, | ||
| "activation_cpu_offload": true | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| #!/bin/bash | ||
| CUDA_VISIBLE_DEVICES=0,1 \ | ||
| swift sft \ | ||
| --model 'Qwen/Qwen3-0.6B' \ | ||
| --dataset 'swift/self-cognition#1000' \ \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| --load_from_cache_file true \ | ||
| --split_dataset_ratio 0.01 \ | ||
| --train_type lora \ | ||
| --torch_dtype bfloat16 \ | ||
| --num_train_epochs 1 \ | ||
| --per_device_train_batch_size 1 \ | ||
| --per_device_eval_batch_size 1 \ | ||
| --learning_rate 1e-4 \ | ||
| --lora_rank 8 \ | ||
| --lora_alpha 32 \ | ||
| --target_modules all-linear \ | ||
| --freeze_vit true \ | ||
| --gradient_accumulation_steps 16 \ | ||
| --eval_steps 100 \ | ||
| --save_steps 100 \ | ||
| --save_total_limit 2 \ | ||
| --logging_steps 5 \ | ||
| --max_length 2048 \ | ||
| --output_dir output \ | ||
| --warmup_ratio 0.05 \ | ||
| --dataloader_num_workers 4 \ | ||
| --fsdp './examples/train/activation_cpu_offload/fsdp2.json' | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # test env: 1 * A10 | ||
| # Using use_cce: 2.62GB | ||
| # Not using use_cce: 16.24G | ||
|
|
||
| # Install CCE dependency | ||
| pip install "cut-cross-entropy[transformers] @ git+https://github.com/axolotl-ai-cloud/ml-cross-entropy.git@f643b88" | ||
|
|
||
| # Run ms-swift (example) | ||
| swift sft \ | ||
| --model Qwen/Qwen2.5-0.5B-Instruct \ | ||
| --dataset gsm8k#1024 \ | ||
| --train_type lora \ | ||
| --per_device_train_batch_size 64 \ | ||
| --per_device_eval_batch_size 64 \ | ||
| --use_hf true \ | ||
| --use_cce true \ | ||
| "$@" |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,25 @@ | ||||||
| { | ||||||
| "compute_environment": "LOCAL_MACHINE", | ||||||
| "debug": false, | ||||||
| "distributed_type": "FSDP", | ||||||
| "downcast_bf16": "no", | ||||||
| "fsdp_config": { | ||||||
| "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP", | ||||||
| "fsdp_cpu_ram_efficient_loading": true, | ||||||
| "fsdp_reshard_after_forward": true, | ||||||
| "fsdp_state_dict_type": "FULL_STATE_DICT", | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using
Suggested change
|
||||||
| "fsdp_activation_checkpointing": true, | ||||||
| "fsdp_version": 2 | ||||||
| }, | ||||||
| "machine_rank": 0, | ||||||
| "main_training_function": "main", | ||||||
| "mixed_precision": "bf16", | ||||||
| "num_machines": 1, | ||||||
| "num_processes": 2, | ||||||
| "rdzv_backend": "static", | ||||||
| "same_network": true, | ||||||
| "tpu_env": [], | ||||||
| "tpu_use_cluster": false, | ||||||
| "tpu_use_sudo": false, | ||||||
| "use_cpu": false | ||||||
| } | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| CUDA_VISIBLE_DEVICES=0,1 \ | ||
| NPROC_PER_NODE=2 \ | ||
| swift sft \ | ||
| --model Qwen/Qwen3-4B \ | ||
| --dataset swift/self-cognition#200 \ | ||
| --train_type full \ | ||
| --torch_dtype bfloat16 \ | ||
| --num_train_epochs 1 \ | ||
| --per_device_train_batch_size 4 \ | ||
| --learning_rate 1e-5 \ | ||
| --weight_decay 0.1 \ | ||
| --gradient_accumulation_steps 1 \ | ||
| --eval_steps 100 \ | ||
| --save_steps 100 \ | ||
| --save_total_limit 2 \ | ||
| --logging_steps 1 \ | ||
| --max_length 2048 \ | ||
| --output_dir output \ | ||
| --system 'You are a helpful assistant.' \ | ||
| --warmup_ratio 0.05 \ | ||
| --dataloader_num_workers 4 \ | ||
| --use_tiled_mlp true \ | ||
| --tiled_mlp_num_shards 4 \ | ||
| --deepspeed zero3 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| #!/bin/bash | ||
| # FSDP2 training with tiled MLP | ||
| # Requires accelerate config with fsdp_version: 2 | ||
|
|
||
| # First, create the accelerate config (fsdp2.json) or use the one in examples/train/multi-gpu/fsdp2_lora/ | ||
|
|
||
| # FSDP2 with tiled MLP | ||
| accelerate launch --config_file fsdp2.json \ | ||
| -m swift sft \ | ||
| --model Qwen/Qwen3-4B \ | ||
| --dataset swift/self-cognition#200 \ | ||
| --train_type full \ | ||
| --torch_dtype bfloat16 \ | ||
| --num_train_epochs 1 \ | ||
| --per_device_train_batch_size 4 \ | ||
| --learning_rate 1e-5 \ | ||
| --gradient_checkpointing false \ | ||
| --weight_decay 0.1 \ | ||
| --gradient_accumulation_steps 1 \ | ||
| --eval_steps 100 \ | ||
| --save_steps 100 \ | ||
| --save_total_limit 2 \ | ||
| --logging_steps 1 \ | ||
| --max_length 2048 \ | ||
| --output_dir output \ | ||
| --system 'You are a helpful assistant.' \ | ||
| --warmup_ratio 0.05 \ | ||
| --dataloader_num_workers 4 \ | ||
| --use_tiled_mlp true \ | ||
| --tiled_mlp_num_shards 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
_notefield is a bit confusing. It states that this configuration is for training "without CPU offloading", but the file is in a directory namedactivation_cpu_offloadand the configuration itself enablesactivation_cpu_offload. This should be corrected to avoid confusion.