Skip to content

FT for GLM45#4583

Open
Xing-lil wants to merge 3 commits into
PaddlePaddle:developfrom
Xing-lil:FT_glm45
Open

FT for GLM45#4583
Xing-lil wants to merge 3 commits into
PaddlePaddle:developfrom
Xing-lil:FT_glm45

Conversation

@Xing-lil

@Xing-lil Xing-lil commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Others

PR changes

Others

Description

FT for GLM45

@paddle-bot

paddle-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@Paddle-CI-Bot

Paddle-CI-Bot commented Jun 1, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26741020194 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Integration test (H20, single card) 显卡掉 CI 维护人员检查 H20 single card 机器 GPU 是否正常挂载,paddlefleet_ops.__init__ 在 module 级别调用 paddle.cuda.get_device_capability() 时 Paddle 检测到 CPU Place,说明 CUDA 设备初始化失败 报错代码
Integration test (H20, multi-card) Loss Diff PR 修改引入了精度变化,需要精度 approver 在 PR 上 approve 后 CI 自动放行 报错代码
Integration test (A100) 退出码250 + Loss Diff GLM4.5 pt/sft 有 exit code -6(SIGABRT,训练进程崩溃)且全模型存在精度 diff,需要精度 approver approve,lora 精度偏差较大(Max abs diff=0.476)需额外排查 报错代码

失败的测试case:

[H20 single card]
- GLM4.5 single-card (glm45_pt_single_card.sh)           → 显卡掉,paddlefleet_ops import 阶段 CUDA device 不可用
- Qwen3-30B-A3B single-card (qwen3_single_card.sh)       → 同上,同一根因
- Qwen3-vl-8k-single-card (qwen3vl_sft_single_card.sh)   → 同上,同一根因

[H20 multi-card]
- GLM4.5 pt             (glm45_pt.sh)              → Loss Diff: Log=11.37636757 vs GT=11.37643719 (abs_diff=6.96e-05)
- GLM4.5 sft            (glm45_sft.sh)             → Loss Diff: Log=0.04137547  vs GT=0.04099750 (abs_diff=3.78e-04)
- GLM4.5 sft cp         (glm45_sft_cp.sh)          → Loss Diff: Log=0.03640614  vs GT=0.03707570 (abs_diff=6.70e-04)
- GLM4.5 lora           (glm45_lora.sh)            → Loss Diff: Log=0.04060261  vs GT=0.03884196 (abs_diff=1.76e-03)
- GLM4.5 dpo            (glm45_dpo.sh)             → Loss Diff: Log=0.60926580  vs GT=0.60890049 (abs_diff=3.65e-04)
- GLM4.5 EP4            (glm45_pt_ep4.sh)          → Loss Diff: Log=11.37565041 vs GT=11.37552071 (abs_diff=1.30e-04)
- GLM4.5 Grouped GEMM   (glm45_pt_grouped_gemm.sh) → Loss Diff: Log=11.37382889 vs GT=11.37383652 (abs_diff=7.63e-06)
- Qwen3 sft             (qwen.sh sft)              → Loss Diff: Log=9.80192375  vs GT=9.80183220 (abs_diff=9.16e-05)
✓ PASSED: GLM4.5 dpo lora, GLM4.5 FP8

[A100]
- GLM4.5 pt   (glm45_a100.sh pt)    → exit code -6 (SIGABRT) + Loss Diff: Log=11.9234848  vs GT=11.9233017
- GLM4.5 sft  (glm45_a100.sh sft)   → exit code -6 (SIGABRT) + Loss Diff: Log=0.02823288  vs GT=0.02788216
- GLM4.5 lora (glm45_a100.sh lora)  → Loss Diff: Log=6.752000  vs GT=7.228000 (abs_diff=0.476, 最大偏差!)
- GLM4.5 dpo  (glm45_a100.sh dpo)   → Loss Diff: Log=0.74227977 vs GT=0.74248558
- Qwen3 pt    (qwen3_a100.sh pt)    → Loss Diff: Log=11.4033432  vs GT=11.40356922
- Qwen3 sft   (qwen3_a100.sh sft)   → Loss Diff: Log=9.83459663  vs GT=9.83432674
- Qwen3 lora  (qwen3_a100.sh lora)  → exit code -6 (SIGABRT) + Loss Diff: Log=9.50580788 vs GT=9.50577545
- Qwen3VL sft moe (qwen3vl_sft.sh moe a100) → Loss Diff: Log=12.33152199 vs GT=12.33158302
✓ PASSED: GLM4.5 dpo lora

根本原因分析:

PR FT_glm45#4583)向代码中引入了 GLM4.5 新模型的训练逻辑,导致三类失败:

  1. H20 single card 显卡掉paddlefleet_ops 在 module import 时无条件调用 paddle.cuda.get_device_capability(),而 H20 single card CI 机器 CUDA 设备未被正确识别(Place(cpu) 被返回),属于已知 CI 机器环境问题,与 PR 代码无直接关系。
  2. 全平台 Loss Diff:PR 修改了 GLM4.5 相关的训练/SFT/LoRA/DPO 路径,导致所有相关 case 的 Step-10 loss 与 baseline GT 产生微小偏差(精度容忍度为 rtol=0, atol=0,即零容忍);check_precision_approval.sh 检测到 precision 发生变化,要求 approver 介入。
  3. A100 exit code -6(SIGABRT)glm45_ptglm45_sftqwen3_lora 在 A100 上训练进程以 -6 终止(SIGABRT),训练虽然完成(***** train metrics ***** 存在,故 check_log_for_exitcode 返回 0),但 LAUNCH 层捕获了 rank 0 退出码 -6,结合 lora case 的 abs_diff=0.476 异常偏大,提示该模型配置在 A100 上存在数值稳定性或 paddle/paddlefleet 兼容性问题。

修复建议:

  1. H20 single card 显卡掉:CI 维护人员检查 formers-fleet-CI-4583-integration-test-single-card-* 容器内 GPU 是否正常挂载(nvidia-smi 是否可用);若为机器临时故障,rerun 即可。

  2. Loss Diff 精度审批:由以下 approver 之一在 PR FT for GLM45 #4583 上完成 approve,CI check_precision_approval.sh 将自动放行:

    • 组 1(至少 1 人):@XieYunshen @From00 @risemeup1 @tianlef
    • 组 2(至少 1 人):@lugimzzz @zjjlivein @tianlef
    • 组 3(至少 1 人):@tianlef @swgu98

    完成 approve 后 rerun 失败流水线,Loss Diff 类错误会自动通过(前提是 loss 值本身合理)。

  3. A100 lora 精度偏差过大(abs_diff=0.476):该偏差远超其他 case(通常 < 1e-3),需排查 tests/config/ci/glm45_lora_a100.yaml 配置或 GLM4.5 lora 实现在 A100 上是否存在权重初始化/学习率/数据顺序问题;建议先在 A100 上手动复现,确认复现后更新 GT 基准或修复实现后重新生成 GT。

  4. A100 exit code -6(SIGABRT):确认 A100 CI 机器 paddlefleet 版本与 GLM4.5 新代码兼容;若 -6 是偶发竞态,rerun 验证;若稳定复现,需查 paddleformers_dist_log/workerlog.0 中的 backtrace(当前 CI log 中未上传 workerlog 内容)。


🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants