Skip to content

test cuda13#4564

Open
zjjlivein wants to merge 3 commits into
PaddlePaddle:developfrom
zjjlivein:test_cuda13.2
Open

test cuda13#4564
zjjlivein wants to merge 3 commits into
PaddlePaddle:developfrom
zjjlivein:test_cuda13.2

Conversation

@zjjlivein

Copy link
Copy Markdown
Collaborator

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot

paddle-bot Bot commented May 28, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@Paddle-CI-Bot

Paddle-CI-Bot commented May 28, 2026

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26740925748 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Integration test (H20, single card) 显卡掉 CI 维护人员检查 H20 机器 GPU 可见性,确认 CUDA device 正常挂载后 rerun 报错代码
Integration test (A100) Loss Diff + 精度审批 Loss 对齐需 XieYunshen / From00 / risemeup1 / tianlef 其中一人 approve,以及 lugimzzz / zjjlivein / tianlef 其中一人 approve,以及 tianlef / swgu98 其中一人 approve,三组全部通过后方可合入 报错代码
Integration test (H20, multi-card) 显卡掉 同 H20 single-card,paddlefleet_ops/__init__.py import 时调用 paddle.cuda.get_device_capability() 但 CUDA device 未就绪,确认机器 GPU 可见性后 rerun 报错代码

失败的测试 case:

# Integration test (H20, single card) — job 78782312398
- Integration test (GLM4.5 single-card)
- Integration test (Qwen3-30B-A3B single-card)
- Qwen3-vl-8k-single-card

# Integration test (A100) — job 78782312404
- GLM4.5 pre-train         (exit 250 + Loss Diff: Log=11.9234848 vs GT=11.9233017 + 精度审批未通过)
- GLM4.5 sft               (精度审批未通过)
- GLM4.5 lora              (精度审批未通过)
- GLM4.5 dpo               (exit code 1)
- Qwen pre-train           (exit code 1)
- Qwen sft                 (exit code 1)
- Qwen lora                (exit code 1)
- Qwen vl moe              (精度审批未通过)

# Integration test (H20, multi-card) — job 78782312405
- GLM4.5 pre-train         (ValueError: cuda.get_device_properties)
- GLM4.5 sft
- GLM4.5 sft cp
- GLM4.5 lora
- GLM4.5 dpo
- GLM4.5 dpo_lora
- GLM4.5 pre-train (EP4)
- GLM4.5 pre-train (FP8)
- GLM4.5 pre-train (Grouped GEMM)
- Qwen pre-train
- Qwen sft
- Qwen lora
- Qwen vl sft
- Qwen vl lora
- Qwen vl moe
- Qwen3-vl-8k-fsdp

根本原因分析:

本次 PR #4564(branch test_cuda13.2,commit d414cee)涉及精度相关改动,导致三类失败:

  1. H20 机器(single / multi-card)显卡掉paddlefleet_ops/__init__.py line 55 在 import 阶段调用 paddle.cuda.get_device_capability(),而此时 paddle.device 返回的是 Place(cpu) 而非 CUDA device,说明 H20 机器在本次 CI 运行期间 GPU 设备未正常注册,属于已知"显卡掉"问题,与 PR 代码本身无关。

  2. A100 Loss Diff:GLM4.5 pre-train 在 Step 10 的 loss(11.9234848)与 GT(11.9233017)存在 1.83e-4 的绝对差,超出 atol=0 的零容差要求,触发精度审批流程。该 PR 引入了 cuda13.2 相关改动,可能轻微影响 CUDA kernel 精度路径。

  3. 精度审批未通过:Loss Diff 检测触发后,check_precision_approval.sh 发现 PR test cuda13 #4564 尚无任何 reviewer approve,三组审批人(XieYunshen/From00/risemeup1/tianlef、lugimzzz/zjjlivein/tianlef、tianlef/swgu98)均未提交 approved review,CI 以 exit 6 终止。


修复建议:

  1. H20 显卡掉:由 CI 维护人员检查 yqlcc01-bbc-yqonlinea-com-1567435 机器的 GPU 挂载状态,确认 CUDA device 可见后重新触发 H20 single-card 和 H20 multi-card 两个 job。

  2. A100 Loss Diff + 精度审批

    • 需以下三组各至少一人在 PR test cuda13 #4564 上提交 Approved review
      • 组1:XieYunshen / From00 / risemeup1 / tianlef
      • 组2:lugimzzz / zjjlivein / tianlef
      • 组3:tianlef / swgu98
    • 若 Loss Diff 属于 cuda13.2 环境正常精度浮动,需同步更新 GT baseline 文件 glm45_pt_multi_card_a100_gt_loss.txt(及其他受影响模型的 gt 文件),并请精度审批人确认后 approve。
    • 若不接受精度变化,则排查 cuda13.2 引入的 kernel 路径差异,对齐 loss 后重新提交。

🔍 准确性记录:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),将自动记录到 CI 监控系统

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants