Checklist / 检查清单
Bug Description / Bug 描述
在使用 ms-swift 进行 SFT 训练时,观察到 GPU memory allocated 在训练过程中不稳定,训练间的val过程中会有显存的突然变化,甚至可能导致OOM。val数据集是split_dataset_ratio划分的,每次应该是相同的。
How to Reproduce / 如何复现
# -------- model & data --------
model: Qwen3.5-9B
dataset: xxx
split_dataset_ratio: 0.05
tuner_type: full
# -------- precision / attention --------
torch_dtype: bfloat16
attn_impl: flash_attn
padding_free: true
packing: false
# -------- optimizer schedule --------
num_train_epochs: 3
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
gradient_accumulation_steps: 2
gradient_checkpointing: true
learning_rate: 2.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 1.0
# -------- sequence length --------
max_length: 65536
truncation_strategy: delete
# -------- loss / agent specifics --------
agent_template: qwen3_5
loss_scale: ignore_empty_think
add_non_thinking_prefix: true
use_liger_kernel: true
# -------- multimodal heads (Qwen3.5 is a VL arch even though our data is text) --------
freeze_vit: true
freeze_aligner: true
freeze_llm: false
# -------- distributed --------
deepspeed: zero3_offload
# -------- checkpointing / logging --------
save_strategy: steps
save_steps: 30
# save_total_limit: 3
eval_strategy: steps
eval_steps: 10
use_logits_to_keep: true
logging_steps: 1
add_version: true
report_to:
- tensorboard
- swanlab
# -------- io --------
dataset_num_proc: 8
dataloader_num_workers: 4
output_dir: xxx
Additional Information / 补充信息
No response
Checklist / 检查清单
Bug Description / Bug 描述
在使用 ms-swift 进行 SFT 训练时,观察到 GPU memory allocated 在训练过程中不稳定,训练间的val过程中会有显存的突然变化,甚至可能导致OOM。val数据集是split_dataset_ratio划分的,每次应该是相同的。
How to Reproduce / 如何复现
Additional Information / 补充信息
No response