Skip to content

SFT 训练过程中 GPU 显存波动异常 #9495

@kascas

Description

@kascas

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

Image

在使用 ms-swift 进行 SFT 训练时,观察到 GPU memory allocated 在训练过程中不稳定,训练间的val过程中会有显存的突然变化,甚至可能导致OOM。val数据集是split_dataset_ratio划分的,每次应该是相同的。

How to Reproduce / 如何复现

  • ms-swift版本:4.2.3
  • 训练配置:
# -------- model & data --------
model: Qwen3.5-9B
dataset: xxx
split_dataset_ratio: 0.05

tuner_type: full

# -------- precision / attention --------
torch_dtype: bfloat16
attn_impl: flash_attn
padding_free: true
packing: false

# -------- optimizer schedule --------
num_train_epochs: 3
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
gradient_accumulation_steps: 2
gradient_checkpointing: true
learning_rate: 2.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.1
max_grad_norm: 1.0

# -------- sequence length --------
max_length: 65536
truncation_strategy: delete

# -------- loss / agent specifics --------
agent_template: qwen3_5
loss_scale: ignore_empty_think
add_non_thinking_prefix: true
use_liger_kernel: true

# -------- multimodal heads (Qwen3.5 is a VL arch even though our data is text) --------
freeze_vit: true
freeze_aligner: true
freeze_llm: false

# -------- distributed --------
deepspeed: zero3_offload

# -------- checkpointing / logging --------
save_strategy: steps
save_steps: 30
# save_total_limit: 3
eval_strategy: steps
eval_steps: 10
use_logits_to_keep: true
logging_steps: 1
add_version: true
report_to:
  - tensorboard
  - swanlab

# -------- io --------
dataset_num_proc: 8
dataloader_num_workers: 4
output_dir: xxx

Additional Information / 补充信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions