Skip to content

test: verify approval CI passes for non-protected files#4577

Open
liuhao2638 wants to merge 1 commit into
PaddlePaddle:developfrom
liuhao2638:test-approval-ci-noblock
Open

test: verify approval CI passes for non-protected files#4577
liuhao2638 wants to merge 1 commit into
PaddlePaddle:developfrom
liuhao2638:test-approval-ci-noblock

Conversation

@liuhao2638

Copy link
Copy Markdown
Contributor

Test PR modifying paddleformers/trainer/trainer.py (not a protected file). Approval CI should pass without From00 approval.

@paddle-bot

paddle-bot Bot commented May 29, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@Paddle-CI-Bot

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26642368471 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Fleet Model Test · Integration test (H20, single card) Loss Diff + 精度未审批 请 XieYunshen/From00/risemeup1/tianlef 以及 lugimzzz/zjjlivein/tianlef、tianlef/swgu98 各组分别 Approve PR #4577 报错代码
Fleet Model Test · Integration test (A100) Loss Diff + 精度未审批 同上,需三组 reviewer 各至少 1 人 Approve 报错代码
Fleet Model Test · Integration test (H20, multi-card) Loss Diff + 精度未审批 同上,需三组 reviewer 各至少 1 人 Approve 报错代码

失败的测试 case:

# Job 78505516033 — Integration test (H20, single card)
- Integration test (GLM4.5 single-card)
    Check Failed: 7/10 steps mismatched, Max abs diff=3.243e-05
    → check_precision_approval.sh: APPROVALS=FALSE, exit 1
- Integration test (Qwen3-30B-A3B single-card)
    Check Failed: 7/10 steps mismatched, Max abs diff=0.00028706
    → check_precision_approval.sh: APPROVALS=FALSE, exit 1

# Job 78505516138 — Integration test (A100)
- GLM4.5 pre-train
    Check Failed: Max abs diff=0.0001831, Max rel diff=1.54e-05
- GLM4.5 sft
    Check Failed: Max abs diff=0.00035072, Max rel diff=0.01257865
- GLM4.5 lora  ← 差异最大
    Check Failed: Log Loss=6.752 vs GT=7.228, Max abs diff=0.476, Max rel diff=0.0659
- GLM4.5 dpo
    Check Failed: Max abs diff=0.00020581, Max rel diff=0.00027719
- Qwen pre-train
    Check Failed: Max abs diff=0.00022602, Max rel diff=1.98e-05
- Qwen sft
    Check Failed: Max abs diff=0.00026989, Max rel diff=2.74e-05
- Qwen lora
    Check Failed: Max abs diff=3.243e-05, Max rel diff=3.41e-06
- Qwen vl moe
    Check Failed: Max abs diff=6.103e-05, Max rel diff=4.95e-06

# Job 78505516240 — Integration test (H20, multi-card)
- GLM4.5 pre-train
    Check Failed: Log Loss=11.37636757 vs GT=11.37643719, Max abs diff=6.962e-05
- GLM4.5 sft
    Check Failed: Max abs diff=0.00037797, Max rel diff=0.00921934
- GLM4.5 sft cp
    Check Failed: Max abs diff=0.00066956, Max rel diff=0.01805927
- GLM4.5 lora
    Check Failed: Max abs diff=0.00176065, Max rel diff=0.04532856
- GLM4.5 dpo
    Check Failed: Max abs diff=0.00036531, Max rel diff=0.00059995
- GLM4.5 pre-train (EP4)
    Check Failed: Max abs diff=0.0001297, Max rel diff=1.14e-05
- GLM4.5 pre-train (Grouped GEMM)
    Check Failed: Max abs diff=7.63e-06, Max rel diff=6.71e-07
- Qwen sft
    Check Failed: Max abs diff=9.155e-05, Max rel diff=9.34e-06
- Qwen lora
    Check Failed: Max abs diff=6.866e-05, Max rel diff=7.30e-06
- Qwen vl moe
    Check Failed: Max abs diff=1.908e-05, Max rel diff=1.54e-06

根本原因分析:

PR #4577 修改了 paddleformers/trainer/trainer.py(训练主循环),尽管 PR 描述声称是测试"非受保护文件"的 Approval CI,但 trainer.py 直接影响前向/梯度更新路径,导致全部模型(GLM4.5、Qwen3)在三套硬件(H20 单卡、A100、H20 多卡)上的 loss 输出与 BOS 存储的 ground-truth 产生浮点差异,触发 check_precision_approval.sh 的精度审批门控,因 PR 上尚无任何指定 reviewer 的 Approve,三组检查全部返回 APPROVALS=FALSE 并以 exit 1 终止流水线。


修复建议:

  1. 获取精度审批:在 PR test: verify approval CI passes for non-protected files #4577 上依次请求以下三组 reviewer 各至少 1 人点击 GitHub Approve

    • 第 1 组(必须之一):@XieYunshen@From00@risemeup1@tianlef
    • 第 2 组(必须之一):@lugimzzz@zjjlivein@tianlef
    • 第 3 组(必须之一):@tianlef@swgu98
  2. 关注 GLM4.5 lora (A100) 异常大 diff:该 case Log Loss=6.752 vs GT=7.228,abs diff=0.476(相比其他 case 高出 3 个数量级),需额外确认 trainer.py 的改动是否对 lora 参数更新逻辑产生非预期影响,建议在 Approve 前先 review 该 case 的完整训练脚本差异。

  3. 更新 GT loss 基线(如改动符合预期):若精度变更属有意为之,在审批通过后,更新 https://xly-devops.cdn.bcebos.com/PaddleFleet/precision/PaddleFormers_latest/ 下对应的 *_gt_loss.txt 文件,使后续 CI 以新基线对齐。


🔍 准确性反馈:请点击评论底部 😊 图标,选择 👍(准确)或 👎(有误),反馈将自动记录到 CI 监控系统,感谢大家♥️

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants