Skip to content

[Benchmark] Add GLM-4.5-Air 128K PT config#4565

Open
liym27 wants to merge 1 commit into
PaddlePaddle:developfrom
liym27:add-glm45-air-128k-yaml
Open

[Benchmark] Add GLM-4.5-Air 128K PT config#4565
liym27 wants to merge 1 commit into
PaddlePaddle:developfrom
liym27:add-glm45-air-128k-yaml

Conversation

@liym27

@liym27 liym27 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Add benchmark config for GLM-4.5-Air (106B-A12B MoE, 46 layers) at 128K context length under PaddleFleet PT stage.

Parallelism: TP=8, PP=4, EP=8, DP=1 (32 cards)
PP layer distribution: [10, 12, 12, 12]
(num_empty_layers_add_in_head=2, num_empty_layers_add_in_tail=0)
Sequence length: 131072
Global batch size: 128
Learning rate: 5.0e-5 with cosine decay
Router aux loss: 1.0e-4
Checkpoint format: flex_checkpoint

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

@paddle-bot

paddle-bot Bot commented May 28, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@liym27 liym27 force-pushed the add-glm45-air-128k-yaml branch 2 times, most recently from bf14abc to b04025d Compare May 28, 2026 13:46
Add benchmark config for GLM-4.5-Air (106B-A12B MoE, 46 layers) at
128K context length under PaddleFleet PT stage.

Parallelism: TP=8, PP=4, EP=8, DP=1 (32 cards)
PP layer distribution: [10, 12, 12, 12]
  (num_empty_layers_add_in_head=2, num_empty_layers_add_in_tail=0)
Sequence length: 131072
Global batch size: 128
Learning rate: 5.0e-5 with cosine decay
Router aux loss: 1.0e-4
Checkpoint format: flex_checkpoint
@liym27 liym27 force-pushed the add-glm45-air-128k-yaml branch from b04025d to 27f9169 Compare May 28, 2026 13:52
@Paddle-CI-Bot

Copy link
Copy Markdown

PaddleFormers Log Analysis

Run #26606051666 · Attempt 1

日志分析报告

流水线名称 问题标签 修复建议 日志片段
Fleet Model Test / GLM4.5 sft Blocking Queue 崩溃 DataLoader worker 抛异常导致 blocking queue 被 kill(exit code 241),检查 glm45_sft.sh 数据管道配置,或 rerun 验证是否偶现 报错代码
Fleet Model Test / GLM4.5 lora 检查点文件缺失 glm_full_pp_ckpts 目录下缺少 config.json,sft step 异常退出导致 checkpoint 未写入,lora 依赖该 checkpoint 故连锁失败 报错代码
Fleet Model Test / GLM4.5 dpo 检查点文件缺失 同 lora,run_dpo 尝试 AutoConfig.from_pretrained(/workspace/glm45_fleet/checkpoints/glm_full_pp_ckpts) 找不到 config.json,根因同为 sft 未生成 checkpoint 报错代码

失败的测试 case:

Fleet Model Test - Integration test (H20, multi-card)
  ├── GLM4.5 sft         [FAILED] exit code 241 → SystemError: Blocking queue is killed
  ├── GLM4.5 lora        [FAILED] exit code 1   → FileNotFoundError: config.json not found in glm_full_pp_ckpts
  └── GLM4.5 dpo         [FAILED] exit code 1   → FileNotFoundError: config.json not found in glm_full_pp_ckpts

根本原因分析:

PR #4565add-glm45-air-128k-yaml)新增了 GLM-4.5-Air 128K benchmark 配置,但本次 CI 是 GLM4.5 基础集成测试。真正的根因是 GLM4.5 sft 训练在第 9 步完成后,DataLoader 的 blocking queue 被 worker 异常 kill(exit code 241),导致 ***** train metrics ***** 未输出、checkpoint /workspace/glm45_fleet/checkpoints/glm_full_pp_ckpts 未写入,后续 lora / dpo 两个 step 均依赖该 checkpoint 做 AutoConfig.from_pretrained,因此连锁报 FileNotFoundError。该问题与 PR 改动(仅加 yaml 配置)直接关联性低,更可能是 Blocking queue is killed 高并发偶现问题。

修复建议:

  1. 优先 rerun 当前 CI job,确认 Blocking queue is killed(exit code 241)是否为偶现问题(参考规则:该错误属已知高并发偶现)。
  2. 若 rerun 仍复现,排查 glm45_sft.sh 中 DataLoader worker 配置:检查 num_workerspersistent_workers 或数据集读取是否有异常,定位 worker 抛出的原始异常(Blocking queue 错误本身仅是上层捕获,真实错误在 worker 进程)。
  3. lora / dpo 失败为 sft 连锁影响,sft 修复后自动恢复,无需单独处理。

🔄 每次 Re-run 后自动更新

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants