test cuda13 by zjjlivein · Pull Request #4564 · PaddlePaddle/PaddleFormers

zjjlivein · 2026-05-28T12:34:51Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

paddle-bot · 2026-05-28T12:34:57Z

Thanks for your contribution!

Paddle-CI-Bot · 2026-05-28T19:36:46Z

PaddleFormers Log Analysis

Run #26740925748 · Attempt 1

日志分析报告

流水线名称	问题标签	修复建议	日志片段
Integration test (H20, single card)	显卡掉	CI 维护人员检查 H20 机器 GPU 可见性，确认 CUDA device 正常挂载后 rerun	报错代码
Integration test (A100)	Loss Diff + 精度审批	Loss 对齐需 XieYunshen / From00 / risemeup1 / tianlef 其中一人 approve，以及 lugimzzz / zjjlivein / tianlef 其中一人 approve，以及 tianlef / swgu98 其中一人 approve，三组全部通过后方可合入	报错代码
Integration test (H20, multi-card)	显卡掉	同 H20 single-card，`paddlefleet_ops/__init__.py` import 时调用 `paddle.cuda.get_device_capability()` 但 CUDA device 未就绪，确认机器 GPU 可见性后 rerun	报错代码

失败的测试 case:

# Integration test (H20, single card) — job 78782312398
- Integration test (GLM4.5 single-card)
- Integration test (Qwen3-30B-A3B single-card)
- Qwen3-vl-8k-single-card

# Integration test (A100) — job 78782312404
- GLM4.5 pre-train         (exit 250 + Loss Diff: Log=11.9234848 vs GT=11.9233017 + 精度审批未通过)
- GLM4.5 sft               (精度审批未通过)
- GLM4.5 lora              (精度审批未通过)
- GLM4.5 dpo               (exit code 1)
- Qwen pre-train           (exit code 1)
- Qwen sft                 (exit code 1)
- Qwen lora                (exit code 1)
- Qwen vl moe              (精度审批未通过)

# Integration test (H20, multi-card) — job 78782312405
- GLM4.5 pre-train         (ValueError: cuda.get_device_properties)
- GLM4.5 sft
- GLM4.5 sft cp
- GLM4.5 lora
- GLM4.5 dpo
- GLM4.5 dpo_lora
- GLM4.5 pre-train (EP4)
- GLM4.5 pre-train (FP8)
- GLM4.5 pre-train (Grouped GEMM)
- Qwen pre-train
- Qwen sft
- Qwen lora
- Qwen vl sft
- Qwen vl lora
- Qwen vl moe
- Qwen3-vl-8k-fsdp

根本原因分析:

本次 PR #4564（branch test_cuda13.2，commit d414cee）涉及精度相关改动，导致三类失败：

H20 机器（single / multi-card）显卡掉：paddlefleet_ops/__init__.py line 55 在 import 阶段调用 paddle.cuda.get_device_capability()，而此时 paddle.device 返回的是 Place(cpu) 而非 CUDA device，说明 H20 机器在本次 CI 运行期间 GPU 设备未正常注册，属于已知"显卡掉"问题，与 PR 代码本身无关。
A100 Loss Diff：GLM4.5 pre-train 在 Step 10 的 loss（11.9234848）与 GT（11.9233017）存在 1.83e-4 的绝对差，超出 atol=0 的零容差要求，触发精度审批流程。该 PR 引入了 cuda13.2 相关改动，可能轻微影响 CUDA kernel 精度路径。
精度审批未通过：Loss Diff 检测触发后，check_precision_approval.sh 发现 PR test cuda13 #4564 尚无任何 reviewer approve，三组审批人（XieYunshen/From00/risemeup1/tianlef、lugimzzz/zjjlivein/tianlef、tianlef/swgu98）均未提交 approved review，CI 以 exit 6 终止。

修复建议:

H20 显卡掉：由 CI 维护人员检查 yqlcc01-bbc-yqonlinea-com-1567435 机器的 GPU 挂载状态，确认 CUDA device 可见后重新触发 H20 single-card 和 H20 multi-card 两个 job。
A100 Loss Diff + 精度审批：
- 需以下三组各至少一人在 PR test cuda13 #4564 上提交 Approved review：
  - 组1：XieYunshen / From00 / risemeup1 / tianlef
  - 组2：lugimzzz / zjjlivein / tianlef
  - 组3：tianlef / swgu98
- 若 Loss Diff 属于 cuda13.2 环境正常精度浮动，需同步更新 GT baseline 文件 glm45_pt_multi_card_a100_gt_loss.txt（及其他受影响模型的 gt 文件），并请精度审批人确认后 approve。
- 若不接受精度变化，则排查 cuda13.2 引入的 kernel 路径差异，对齐 loss 后重新提交。

🔍 准确性记录：请点击评论底部 😊 图标，选择 👍（准确）或 👎（有误），将自动记录到 CI 监控系统

_{🔄 每次 Re-run 后自动更新}

test cuda13

ac8db04

paddle-bot Bot added the contributor label May 28, 2026

zjjlivein added 2 commits May 29, 2026 11:33

update

9a50087

update

d414cee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test cuda13#4564

test cuda13#4564
zjjlivein wants to merge 3 commits into
PaddlePaddle:developfrom
zjjlivein:test_cuda13.2

zjjlivein commented May 28, 2026

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

Paddle-CI-Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zjjlivein commented May 28, 2026

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented May 28, 2026

Uh oh!

Paddle-CI-Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PaddleFormers Log Analysis

日志分析报告

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Paddle-CI-Bot commented May 28, 2026 •

edited

Loading