ci(bfcl): add GLM-5.2-FP8 nightly leg (TP=8 sequential, whole node)#1834
Conversation
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe nightly BFCL workflow gains a ChangesNightly BFCL Workflow: glm-5.2 leg and runner adjustments
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly-bfcl.yml:
- Around line 177-188: In the glm-5.2 configuration block, update the vllm_extra
field to add the --trust-remote-code flag alongside the existing
--kv-cache-dtype fp8_e4m3 argument, as this flag is required by vLLM v0.23.0+ to
properly load GLM-5.2's specialized model architecture from the Hugging Face
hub. Additionally, replace the smg_tool value glm47_moe with the generic glm
parser identifier, which is the current standard in SMG that uniformly handles
GLM-4.7 and GLM-5.x versions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f4fe68a2-164e-4907-9769-8e40e0e40962
📒 Files selected for processing (1)
.github/workflows/nightly-bfcl.yml
GLM-5.2-FP8 (~744GB FP8 weights) does not fit a TP=4 half-node, so it runs as the first sequential whole-node leg: arm A on all 8 GPUs, torn down, then arm B. Uses vLLM's GLM-5.2 recipe (glm47/glm45 parsers, --kv-cache-dtype fp8_e4m3); SMG arm uses glm47_moe + glm45 passed explicitly (the org-prefixed served name won't auto-detect). Also bump the job timeout to 360m (sequential runs two whole-node arms serially) and move build-wheel to k8s-runner-cpu (CPU-only compile). Signed-off-by: key4ng <rukeyang@gmail.com>
Signed-off-by: key4ng <rukeyang@gmail.com>
GLM-5.2 needs --trust-remote-code to load its custom architecture, matching the other large MoE legs (deepseek-v4, minimax-m2.7, kimi-k2.6). Signed-off-by: key4ng <rukeyang@gmail.com>
78c7a12 to
18c23d0
Compare
Description
Problem
The nightly BFCL A/B (SMG parsing frontend vs. pure vLLM) had no GLM-5 family leg. GLM-5.2 ships an
<arg_key>/<arg_value>tool-call format and<think>reasoning, so it's a useful parser-correctness check — but its FP8 checkpoint is far larger than the existing Blackwell legs and cannot be slotted in as-is.Solution
Add a
glm-5.2leg to the nightly BFCL matrix.Why it can't reuse the concurrent half-node pattern: GLM-5.2-FP8 is ~744 GB of weights (753B-param MoE). A B200 has ~180–192 GB, so 4× B200 (720–768 GB) cannot even hold the weights with working room — it does not fit a TP=4 half-node. vLLM's official recipe requires 8× B200 / TP=8. The existing Blackwell legs (DeepSeek-V4-Flash, MiniMax-M2.7, Kimi-K2.6-int4) all fit TP=4 and run two arms concurrently on GPUs 0-3 + 4-7; GLM-5.2 needs the whole node per arm.
So this is the first leg to use the workflow's existing
sequentialarm_mode: arm A (pure vLLM) on all 8 GPUs → score → tear down → arm B (SMG→vLLM gRPC) on all 8 → score → diff. Therun_ab.pysequential path (--score-arm/--diff-baseline/--diff-candidate) was already implemented; no script changes needed.Parsers follow vLLM's GLM-5.2 recipe —
glm47/glm45for the vLLM arm,glm47_moe/glm45for SMG (passed explicitly because the org-prefixed served namezai-org/GLM-5.2-FP8won't match SMG'sglm-5*auto-detect).--kv-cache-dtype fp8_e4m3per the B200 recipe; MTP speculative decoding omitted (perf-only; the A/B isolates parsing).Changes
glm-5.2leg (TP=8,arm_mode: sequential, whole 8-GPU node,glm47/glm47_moe+glm45,--kv-cache-dtype fp8_e4m3,startup_timeout: 3000) and list it in theonlydispatch input.timeout-minutes240 → 360 (sequential runs two whole-node arms serially, ~2× the concurrent legs). Ceiling only; fast legs finish early.build-wheeltok8s-runner-cpu(CPU-only maturin compile — no GPU needed).Test Plan
python -c "import yaml; yaml.safe_load(...)"— workflow parses.glm-5.2with TP=8 /sequential/ correct parsers / whole-nodegpu_a;only=glm-5.2selects exactly that leg.actionlint— clean apart from the pre-existing custom self-hosted runner-label warnings.run_ab.pyalready implements the sequential--score-arm/--diff-baseline/--diff-candidateflags the leg invokes.Full validation runs on the nightly schedule /
workflow_dispatch(requires the Blackwell runner + staged weights).Checklist
cargo +nightly fmtpasses (no Rust changes)cargo clippy --all-targets --all-features -- -D warningspasses (no Rust changes)Summary by CodeRabbit
glm-5.2sequential leg and related runtime configuration.bfcl-abjob timeout from 240 to 360 minutes to support longer-running scenarios.