Daily Cross-Workflow Summary — 2026-06-28
Snapshot: 2026-06-28 02:42 UTC · Only completed runs counted in trends · Auto-updated every 30 min
TL;DR
🔴 RED · ~18 active clusters · 6 🆕 today (R226–R231) · ~12 carrying over · 4 review-ready fixes still open (none merged since yesterday) · scout did not run (cron Mon/Thu)
👉 Today's ask: R225 is now 2 days unfixed and still breaks DeepSeek-V3.2 MTP on BOTH nightlies — its breaker #29413 (9214b933) is merged with no revert/guard in flight (#29499 is an adjacent optimization, not a fix). Revert or ROCm-guard it. R222 (ROCm Conv3D CUDA-only) remains the largest pr-test-amd breaker (~16 job-blocks across 5 runs, still no fix). Land the 4 review-ready fixes that did NOT move yesterday: #29376 (R214), #27141+#29391 (R195), #28889 (R192), #27757 (R2). New today: a never-passed diffusion update_weights_from_disk 500 (R226, both PR workflows) + 3 latest-run-only 8-GPU pr-test-amd regressions (R227/R228/R219) needing a rerun to separate node-flake from regression. Both release-docker workflows ✅ green. pr-test-amd-rocm720 again ≈0 clean signal (cron self-cancel + HF-429 cascade).
Workflow status
| Workflow |
Latest run |
✅ |
❌ |
Trend (completed real failures) |
Δ vs yesterday |
| nightly-test-amd |
Jun-27 28297034445 |
0 |
~8 real (rest HF-infra) |
10·10·10·~6·~8 |
+R229 NEW |
| nightly-test-amd-rocm720 |
Jun-27 28296988041 |
0 |
~10 real (rest HF-infra) |
12·11·~5·~10 |
+R229,R230 NEW |
| release-docker-amd-nightly |
Jun-27 (latest) |
✅ |
0 |
0·0·0 |
0 |
| release-docker-amd-rocm720-nightly |
Jun-27 (latest) |
✅ |
0 |
0·0·0 |
0 |
| amd-aiter-scout |
none (last Jun-25 28199192232) |
— |
— |
— |
no run (not Mon/Thu) |
| pr-test-amd |
rolling Jun-27→28, latest 28306914695 |
0 |
R222+R192+R214+R226/run |
worsening (+R227/R228/R219 latest) |
+R226,R227,R228 NEW |
| pr-test-amd-rocm720 |
Jun-27 28297015673 |
0 |
≈0 clean (cron-cancel + HF-429) |
≈0 real |
+R231 confirmed |
Notes: (1) amd-aiter-scout did not run (cron Mon/Thu); R221/R223 carry over dormant, no fresh data. (2) Both nightlies' Jun-27 runs are still dominated by HF weight-download hangs / 429 (infra), not code. (3) pr-test-amd-rocm720 run 28297015673 is again ≈0 clean signal: two crons share one cancel-in-progress group + an HF-429 fast-fail cascade cancels most downstream jobs.
🆕 NEW clusters today
R226 · 🆕 · Diffusion update_weights_from_disk → HTTP 500 "Inplace update to inference tensor outside InferenceMode" — pr-test-amd + pr-test-amd-rocm720
- Status: NEW 2026-06-28; never-passed (new test file
test_update_weights_from_disk.py). Appears in multimodal-gen 1-GPU shard 3 across 3 pr-test-amd runs and pr-test-amd-rocm720. A secondary fixture bug (perturbed-VAE clone missing transformer dir → setup ERRORs on FLUX.2) rides in the same job.
- Top hypothesis:
[LOW] server-side weight-apply path performs an in-place write on an inference-mode tensor (and a shape mismatch on FLUX.2/Qwen-Image), so every update_weights_from_disk request returns 500. Disconfirming: never-passed ⇒ could be a brand-new test exercising an unimplemented diffusion path rather than a regression. In-flight fix: ❌ none found.
- Suggested triage: confirm whether this test was newly registered (never green anywhere) vs. regressed; if new, treat as feature-gap on the diffusion weight-update endpoint and route to the diffusion owners; fix the perturbed-VAE fixture to materialize a
transformer dir.
| Workflow |
Job (shard) |
Test File |
Test Function |
Error |
Log |
| pr-test-amd |
multimodal-gen-1gpu (3) |
test_update_weights_from_disk.py |
test_update_weights_specific_modules[Qwen-Image] (+4) |
500 "Inplace update to inference tensor" |
Log |
| pr-test-amd |
multimodal-gen-1gpu (3) |
test_update_weights_from_disk.py |
test_update_weights_from_disk_default[Qwen-Image] (+ FLUX.2 setup ERRORs) |
500 / fixture No weights dir for transformer |
Log |
| pr-test-amd-rocm720 |
multimodal-gen-1gpu (3) |
test_update_weights_from_disk.py |
TestUpdateWeightsFromDisk.test_update_weights_specific_modules[Qwen-Image] (+ offload variants) |
assert 500 == 200 |
Log |
R229 · 🆕 · Kimi-K2.6 8-GPU eval TIMEOUT from slow weight load (3300s+ load exhausts 3600s budget) — both nightlies
- Status: NEW 2026-06-28 (startup-bound TIMEOUT, not accuracy); regression since ~Jun 24-25. Disconfirming vs infra: this is a deterministic load-time-exceeds-budget, not a transient 429 — borderline between "model too big for budget" and a load-path slowdown.
- Top hypothesis:
[LOW] weight-load wall-clock for Kimi-K2.6 (8-way) now exceeds the per-file 3600s budget (load ~3303-3359s observed). In-flight fix: ❌ none (rocm720 per-job cited #24076/#29178/#28905 as candidates, none confirmed).
- Suggested triage: bump the per-file timeout for Kimi-K2.6 OR pre-cache weights on the runner; profile
load_model to see if a recent loader change slowed it.
| Workflow |
Job (shard) |
Test File |
Test Function |
Error |
Log |
| nightly-test-amd |
nightly-8-gpu-kimi-k26 |
test_kimi_k26_eval_amd.py |
N/A (eval runner) |
TIMEOUT 3600s → exit 255 (load 3303s) |
Log |
| nightly-test-amd-rocm720 |
nightly-8-gpu-kimi-k26-rocm720 |
test_kimi_k26_eval_amd.py |
test_kimi_k26_gsm8k_accuracy |
TIMEOUT 3600s (load 3359s) |
Log |
R230 · 🆕 · DeepSeek-V4-Pro server SIGKILL (-9) during 8-way fp8 weight load (MI35x ROCm 7.2) — nightly-rocm720
- Status: NEW 2026-06-28 (first failure for these two files today; flaky). Both offline + online retry exit -9. In-flight fix: ❌ none.
- Top hypothesis:
[LOW] OOM / SIGKILL during 8-way fp8 weight load (host or device memory pressure on MI35x). Disconfirming: first-seen today ⇒ may be a one-off node memory issue; needs a rerun.
- Suggested triage: rerun once; if it recurs, capture dmesg/OOM-killer logs and peak host RAM during load.
R227 / R228 / R231 · 🆕 · latest-run-only pr-test regressions (need rerun to separate flake from regression)
- R227
[LOW] — DeepSeek-R1-MXFP4 8-GPU MTP prefill GPU memory access fault → watchdog → server killed, pr-test-amd latest run only. May share a root with R225 (spec/MTP path). Suggested triage: rerun; if persistent, bisect the MTP/spec-decode window.
- R228
[LOW] — Qwen3-Coder-Next 8-GPU decode hang → scheduler watchdog 300s → SIGQUIT → connection refused, pr-test-amd latest run only. Rerun to rule out node flake.
- R231
[LOW] — torch.compile InductorError (AssertionError in post-grad decompose_triton_kernel_wrapper_functional / layernorm forward_hip) on ROCm diffusion T2V/denoising, pr-test-amd-rocm720 (was "minor-new" yesterday, now confirmed in a 2nd run). Suggested triage: a clean non-colliding rerun; if it persists, it's a real ROCm torch.compile gap, not infra.
| ID |
Workflow |
Job (shard) |
Test File |
Test Function |
Error |
Log |
| R227 |
pr-test-amd |
stage-c-8gpu-mi35x (0) |
test_deepseek_r1_mxfp4_8gpu.py |
TestDeepseekR1MXFP4MTP.test_a_gsm8k |
GPU mem fault → watchdog → killed |
Log |
| R228 |
pr-test-amd |
stage-c-8gpu-mi35x (1) |
test_qwen3_coder_next_8gpu.py |
TestQwen3CoderNext.test_bs_1_speed |
decode hang → SIGQUIT → conn refused |
Log |
| R231 |
pr-test-amd-rocm720 |
multimodal-gen-2gpu (1) |
test_server_2_gpu.py |
test_diffusion_generation[wan2_2_t2v_a14b_2gpu …] |
InductorError AssertionError |
Log |
Carry-over active clusters (still red)
R225 · AssertionError "All of them must not be None" in DSA eager draft-extend (dsa_backend.py:721) — both nightlies, DeepSeek-V3.2 MTP
- Status: 2 days persistent (since Jun-27); 4 jobs across both nightlies (MI35x perf + accuracy MTP). AMD/ROCm only (NVIDIA shielded by the CUDA-graph draft-extend path enabled in the same commit).
- Top hypothesis:
[HIGH] breaker #29413 (9214b933, merged Jun-27 06:53) CUDA-gates the new draft-extend graph consumer (_is_cuda or _is_musa) while leaving the AMD eager init_forward_metadata assert (lines 717-722) requiring the now-nulled CPU seq-len mirror. Disconfirming: end-to-end nulling of extend_*_cpu inferred, not traced. In-flight fix: ❌ none — #29413 is merged; #29499 (open) is a DSA replay optimization, NOT a revert/guard.
- Suggested triage: revert
9214b933 on a branch + rerun test_deepseek_v32_mtp_perf_mi35x.py; if confirmed, derive extend_*_cpu from GPU tensors in the eager is_draft_extend_v2 branch or force needs_cpu_seq_lens=True on ROCm. Ping the #29413 author.
R222 · ROCm RuntimeError "causal Conv3D cat/pad fusion is only available on CUDA" (Wan/diffusion VAE) — pr-test-amd (largest) + rocm720
- Status: every pr-test-amd run since #29281 (merged Jun-26); ~16 job-blocks across 5 runs (28297966351, 28289294173, 28282108241, 28273478888, 28306914695) + rocm720 (83838610250 I2V/mova). Hits all Wan2.1/2.2 T2V/I2V + mova variants on 1-GPU and 2-GPU shards.
- Top hypothesis:
[MEDIUM] #29281 added a CUDA-only fused causal-Conv3D fast path in WanVAE decode with no ROCm/Triton fallback. In-flight fix: ❌ none (no Conv3D-guard PR open; the Conv3D search returned only unrelated diffusion PRs).
- Suggested triage: guard the fused path behind
is_cuda with an eager fallback, or revert #29281; rerun test_server_2_gpu.py::test_diffusion_generation[wan2_2_t2v_a14b_2gpu].
Representative rows (all shards share the same top frame): pr-test-amd 2gpu (1) test_server_2_gpu.py::test_diffusion_generation[wan2_2_i2v_a14b_2gpu …]; 1gpu (0) test_server_1_gpu.py::[wan2_1_t2v_1.3b_teacache_enabled …]; rocm720 2gpu (1) [mova_360p_tp2 / wan2_1_i2v_14b_480P/720P_2gpu].
| ID |
Cluster |
Where (latest) |
Status |
In-flight fix |
| R192 |
FLUX.2 modelopt-FP8 torch._scaled_mm HIPBLAS_STATUS_NOT_SUPPORTED |
pr-test-amd 2gpu (1) (test_server_2_gpu.py::[flux2_modelopt_fp8_tp2_t2i]); ~4 runs |
never-passed |
✅ #28889 open — land |
| R214 |
TokenizedGenerateReqInput missing input_embeds (TypeError) |
pr-test-amd stage-b-1gpu (6) + rocm720 stage-b (6) (test_type_based_dispatcher.py) |
recurring since #29214 |
✅ #29376 open — unblock & land |
| R195 |
Mamba extra_buffer needs CUDA/MUSA/NPU (FLA) on ROCm |
nightly qwen35 83838643968, mi35x-qwen35 83838643970; rocm720 83838515323, 83838515348 |
persistent ≥Jun-19 |
✅ #27141+#29391 open — land |
| R19 |
Qwen3-235B-MXFP4 HIP hipErrorCapturedEvent capture abort |
nightly 83838643963; rocm720 83838515337 |
never-passed ≥May-27 |
❌ none (per-job: #27650/#23581 candidates) |
| R2 |
Mistral/Mixtral GSM8K below threshold (chat-eval) |
rocm720 83838515256 (Mistral-7B 0.361) |
never-passed ≥Jun-13 |
✅ #27757 open — land |
| R211 |
DeepSeek-R1 HiCache MI35x — GPU mem fault during gsm8k prefill |
nightly 83838643952 |
never-passed ≥Jun-20 |
❌ none |
| R196 |
VLM DP-encoder mem fault (write to read-only page) |
nightly 4-gpu 83838643955 (test_encoder_dp.py::test_vlm_mmmu_benchmark) |
flaky/model-dependent |
⚠️ #18721 stale |
| R6 |
Qwen3-30B-A3B MoE — GPU mem fault (MI35x) |
rocm720 83838515311 |
recurring (4/5; last pass Jun-23) |
❌ none |
| R210 |
Qwen3.5 triton-DCP GSM8K 0.556<0.90 |
nightly mi35x-qwen35 83838643970 (test_qwen3p5_triton_dcp.py) |
never-passed |
⚠️ #29230 DNM |
| R219 |
DeepSeek-V3.2 (basic) 8-GPU HSA out-of-resources decode abort |
pr-test-amd stage-c-8gpu (1) (test_deepseek_v32_basic.py::test_a_gsm8k) |
latest-run flake |
❌ none |
Known stable / dormant clusters (no action today) · click to expand
| ID |
Cluster |
Where |
Status |
Fix |
| R1 |
VLM MMMU accuracy below threshold |
nightly (today masked by MMMU dataset/429 timeouts) |
never-passed ≥Jun-13 |
❌ none |
| R155 |
DeepSeek-V3.2 (basic) MI35x GSM8K below threshold |
rocm720 (today masked by xet download timeout) |
never-passed on rocm720 |
⚠️ partial #25559/#29050 |
| R213 |
MiniMax-M2.7 GSM8K borderline |
nightly |
borderline/flaky |
❌ none |
| R220 |
Embeddings-API latency threshold |
pr-test-amd stage-b-1gpu-large |
flake (not seen today) |
❌ none |
| R221 |
aiter-caused GPU Hang (exit 134) ROCm 7.2 LoRA |
scout only — no run today |
dormant |
❌ none |
| R223 |
aiter-caused DSV4-Pro-MTP connection-refused |
scout only — no run today |
dormant |
❌ none |
| R212/R224 |
DSV3.2-MTP perf hang / eval borderline |
superseded by R225 on MTP jobs |
dormant |
❌ none |
Infrastructure / orchestration noise (not test failures) · click to expand
- HF weight-download hangs /
429 Too Many Requests / Xet xet_get stalls: dominate both nightlies Jun-27 — nightly-amd (DeepSeek-R1-MXFP4-tp2, GLM-5.1-mxfp4, gpt-oss-120b tokenizer filelock, perf-vlm Qwen3-VL-30B 429, MMMU dataset timeout) and nightly-rocm720 (DSR1-mxfp4-tp4, DSV3-0324, gpt-oss-120b, Grok-2, Qwen3-235B, DSV3.2 xet timeout, VLM 429). Partial fix #23400 open.
- pr-test-amd-rocm720 cron self-cancel + HF-429 cascade: run 28297015673 — two crons share one
cancel-in-progress group; HF-429 on stage-a/multimodal warmup → fast-fail cancels most downstream jobs. ≈0 clean pytest signal.
- pr-test-amd diffusion port-5555
--strict-ports cascade: HF-download timeout on first diffusion test leaks scheduler port 5555 → cascades the rest of the 1-GPU shard (cosmos3/wan/lingbot/qwen-image). Many multimodal-gen-1gpu rows.
- ROCm VRAM-not-clear / zombie KFD pre-flight gate: nightly glm5-mxfp4 83838644036, rocm720 hicache 83838515342. Node reboot required.
- mori build / git-clone network fail: pr-test-amd 83799696814 (corrupt
libabsl_base.so invalid ELF).
- Kimi-K2-MXFP4 BCG watchdog -9 (pr-test-amd 83799697008): flaky (1/6) MoE weight-load watchdog timeout.
Workflow drill-down (per-workflow view)
nightly-test-amd · Jun-27 [28297034445](https://github.com/sgl-project/sglang/actions/runs/28297034445) · ~8 real (rest HF-infra)
nightly-test-amd-rocm720 · Jun-27 [28296988041](https://github.com/sgl-project/sglang/actions/runs/28296988041) · ~10 real (rest HF-infra)
pr-test-amd · rolling Jun-27→28 · latest [28306914695](https://github.com/sgl-project/sglang/actions/runs/28306914695)
| Job (shard) |
Test File |
Test Function |
Cluster |
Error |
| multimodal-gen-2gpu (1) ×~16 blocks/5 runs |
test_server_{1,2}_gpu.py |
test_diffusion_generation[wan2_*] |
R222 |
causal Conv3D CUDA-only |
| multimodal-gen-2gpu (1) |
test_server_2_gpu.py |
[flux2_modelopt_fp8_tp2_t2i] |
R192 |
HIPBLAS_STATUS_NOT_SUPPORTED |
| multimodal-gen-1gpu (3) |
test_update_weights_from_disk.py |
test_update_weights_specific_modules |
R226🆕 |
500 inplace-on-inference tensor |
| stage-b-1gpu-small (6) |
test_type_based_dispatcher.py |
test_type_dispatcher_e2e_performance |
R214 |
TypeError (input_embeds) |
| stage-c-8gpu-mi35x (0) |
test_deepseek_r1_mxfp4_8gpu.py |
test_a_gsm8k |
R227🆕 |
GPU mem fault (MTP) |
| stage-c-8gpu-mi35x (1) |
test_qwen3_coder_next_8gpu.py |
test_bs_1_speed |
R228🆕 |
decode hang → SIGQUIT |
| stage-c-8gpu (1) |
test_deepseek_v32_basic.py |
test_a_gsm8k |
R219 |
HSA out-of-resources |
| (diffusion port-5555 cascades, mori build, kimi-mxfp4 watchdog) |
various |
— |
infra |
downloads / cascades |
pr-test-amd-rocm720 · Jun-27 [28297015673](https://github.com/sgl-project/sglang/actions/runs/28297015673) · ≈0 clean signal (cron-cancel + HF-429)
Real signal buried under cron cancel-in-progress self-cancellation + HF-429 fast-fail cascade: R226 (test_update_weights_from_disk.py 500, 1gpu (3) 83838610262), R214 (test_type_based_dispatcher.py, stage-b (6) 83838610271), R231 (InductorError, 2gpu (1) 83838610250 + 1gpu (0) 83838610266), R222 (causal Conv3D I2V/mova, same 2gpu job). Plus a perf-threshold miss (stage-b-1gpu-large (0) 83838610248) and a test_start_profile_2 watchdog/CUDA-graph-replay stall (stage-b (10) 83838610294) — both inconclusive without a clean rerun. Needs a non-colliding rerun for usable signal.
How this report is generated
- Only
status == "completed" runs counted in trends. Both nightlies' Jun-27 runs treated as completed. Both release-docker workflows ✅ green; amd-aiter-scout did not run (cron Mon/Thu).
- 🆕 NEW today: R226 (diffusion
update_weights 500), R227 (DSR1-MXFP4 8-GPU MTP mem fault), R228 (Qwen3-Coder-Next 8-GPU hang), R229 (Kimi-K2.6 weight-load timeout, both nightlies), R230 (DSV4-Pro fp8 -9, rocm720), R231 (ROCm torch.compile InductorError, confirmed from yesterday's minor-new).
- Carrying over: R225 now 2 days unfixed (breaker #29413 merged, no revert in flight); R222/R195/R214/R192/R2/R19/R211/R196/R6/R210/R219.
- In-flight fixes unchanged since yesterday (none merged): #29376 (R214), #27141+#29391 (R195), #28889 (R192), #27757 (R2).
- Confidence:
FACT/HIGH/MEDIUM/LOW/SPECULATION. Bot does NOT assign Priority — engineers decide from cluster size + persistence + fix availability.
Generated by amd-bot · last updated 2026-06-28 02:42 UTC
Generated by amd-bot using Claude Code CLI (last updated: 2026-06-28 02:42 UTC)
CI Monitor — 2026-06-28
Repo: sgl-project/sglang
Monitored Workflows:
nightly-test-amd.yml
nightly-test-amd-rocm720.yml
release-docker-amd-nightly.yml
release-docker-amd-rocm720-nightly.yml
amd-aiter-scout.yml
pr-test-amd.yml
pr-test-amd-rocm720.yml
Per-workflow failure reports are appended as comments below; the Daily Cross-Workflow Summary is rendered above this section.
Daily Cross-Workflow Summary — 2026-06-28
Snapshot: 2026-06-28 02:42 UTC · Only completed runs counted in trends · Auto-updated every 30 min
TL;DR
🔴 RED · ~18 active clusters · 6 🆕 today (R226–R231) · ~12 carrying over · 4 review-ready fixes still open (none merged since yesterday) · scout did not run (cron Mon/Thu)
👉 Today's ask: R225 is now 2 days unfixed and still breaks DeepSeek-V3.2 MTP on BOTH nightlies — its breaker #29413 (
9214b933) is merged with no revert/guard in flight (#29499 is an adjacent optimization, not a fix). Revert or ROCm-guard it. R222 (ROCm Conv3D CUDA-only) remains the largest pr-test-amd breaker (~16 job-blocks across 5 runs, still no fix). Land the 4 review-ready fixes that did NOT move yesterday: #29376 (R214), #27141+#29391 (R195), #28889 (R192), #27757 (R2). New today: a never-passed diffusionupdate_weights_from_disk500 (R226, both PR workflows) + 3 latest-run-only 8-GPU pr-test-amd regressions (R227/R228/R219) needing a rerun to separate node-flake from regression. Both release-docker workflows ✅ green. pr-test-amd-rocm720 again ≈0 clean signal (cron self-cancel + HF-429 cascade).Workflow status
🆕 NEW clusters today
R226 · 🆕 · Diffusion
update_weights_from_disk→ HTTP 500 "Inplace update to inference tensor outside InferenceMode" — pr-test-amd + pr-test-amd-rocm720test_update_weights_from_disk.py). Appears in multimodal-gen 1-GPU shard 3 across 3 pr-test-amd runs and pr-test-amd-rocm720. A secondary fixture bug (perturbed-VAE clone missingtransformerdir → setup ERRORs on FLUX.2) rides in the same job.[LOW]server-side weight-apply path performs an in-place write on an inference-mode tensor (and a shape mismatch on FLUX.2/Qwen-Image), so everyupdate_weights_from_diskrequest returns 500. Disconfirming: never-passed ⇒ could be a brand-new test exercising an unimplemented diffusion path rather than a regression. In-flight fix: ❌ none found.transformerdir.test_update_weights_from_disk.pytest_update_weights_specific_modules[Qwen-Image](+4)test_update_weights_from_disk.pytest_update_weights_from_disk_default[Qwen-Image](+ FLUX.2 setup ERRORs)No weights dir for transformertest_update_weights_from_disk.pyTestUpdateWeightsFromDisk.test_update_weights_specific_modules[Qwen-Image](+ offload variants)R229 · 🆕 · Kimi-K2.6 8-GPU eval TIMEOUT from slow weight load (3300s+ load exhausts 3600s budget) — both nightlies
[LOW]weight-load wall-clock for Kimi-K2.6 (8-way) now exceeds the per-file 3600s budget (load ~3303-3359s observed). In-flight fix: ❌ none (rocm720 per-job cited #24076/#29178/#28905 as candidates, none confirmed).load_modelto see if a recent loader change slowed it.test_kimi_k26_eval_amd.pytest_kimi_k26_eval_amd.pytest_kimi_k26_gsm8k_accuracyR230 · 🆕 · DeepSeek-V4-Pro server SIGKILL (-9) during 8-way fp8 weight load (MI35x ROCm 7.2) — nightly-rocm720
[LOW]OOM / SIGKILL during 8-way fp8 weight load (host or device memory pressure on MI35x). Disconfirming: first-seen today ⇒ may be a one-off node memory issue; needs a rerun.test_deepseek_v4_pro_fp4.pyTestDeepseekV4ProFp4.setUpClasstest_deepseek_v4_pro_fp4_cp.pyTestDeepseekV4ProFp4CPInterleave.setUpClassR227 / R228 / R231 · 🆕 · latest-run-only pr-test regressions (need rerun to separate flake from regression)
[LOW]— DeepSeek-R1-MXFP4 8-GPU MTP prefill GPU memory access fault → watchdog → server killed, pr-test-amd latest run only. May share a root with R225 (spec/MTP path). Suggested triage: rerun; if persistent, bisect the MTP/spec-decode window.[LOW]— Qwen3-Coder-Next 8-GPU decode hang → scheduler watchdog 300s → SIGQUIT → connection refused, pr-test-amd latest run only. Rerun to rule out node flake.[LOW]— torch.compile InductorError (AssertionErrorin post-graddecompose_triton_kernel_wrapper_functional/ layernormforward_hip) on ROCm diffusion T2V/denoising, pr-test-amd-rocm720 (was "minor-new" yesterday, now confirmed in a 2nd run). Suggested triage: a clean non-colliding rerun; if it persists, it's a real ROCmtorch.compilegap, not infra.test_deepseek_r1_mxfp4_8gpu.pyTestDeepseekR1MXFP4MTP.test_a_gsm8ktest_qwen3_coder_next_8gpu.pyTestQwen3CoderNext.test_bs_1_speedtest_server_2_gpu.pytest_diffusion_generation[wan2_2_t2v_a14b_2gpu …]Carry-over active clusters (still red)
R225 · AssertionError "All of them must not be None" in DSA eager draft-extend (
dsa_backend.py:721) — both nightlies, DeepSeek-V3.2 MTP[HIGH]breaker #29413 (9214b933, merged Jun-27 06:53) CUDA-gates the new draft-extend graph consumer (_is_cuda or _is_musa) while leaving the AMD eagerinit_forward_metadataassert (lines 717-722) requiring the now-nulled CPU seq-len mirror. Disconfirming: end-to-end nulling ofextend_*_cpuinferred, not traced. In-flight fix: ❌ none — #29413 is merged; #29499 (open) is a DSA replay optimization, NOT a revert/guard.9214b933on a branch + reruntest_deepseek_v32_mtp_perf_mi35x.py; if confirmed, deriveextend_*_cpufrom GPU tensors in the eageris_draft_extend_v2branch or forceneeds_cpu_seq_lens=Trueon ROCm. Ping the #29413 author.test_deepseek_v32_mtp_perf_mi35x.pytest_bench_one_batchdsa_backend.py:721→ exit -9test_deepseek_v32_mtp_eval_mi35x.pyTestDeepseekV32TPMTP.setUpClasstest_deepseek_v32_mtp_perf_mi35x.pytest_bench_one_batchtest_deepseek_v32_mtp_eval_mi35x.pyTestDeepseekV32TPMTP.setUpClassR222 · ROCm RuntimeError "causal Conv3D cat/pad fusion is only available on CUDA" (Wan/diffusion VAE) — pr-test-amd (largest) + rocm720
[MEDIUM]#29281 added a CUDA-only fused causal-Conv3D fast path in WanVAE decode with no ROCm/Triton fallback. In-flight fix: ❌ none (no Conv3D-guard PR open; the Conv3D search returned only unrelated diffusion PRs).is_cudawith an eager fallback, or revert #29281; reruntest_server_2_gpu.py::test_diffusion_generation[wan2_2_t2v_a14b_2gpu].Representative rows (all shards share the same top frame): pr-test-amd 2gpu (1)
test_server_2_gpu.py::test_diffusion_generation[wan2_2_i2v_a14b_2gpu …]; 1gpu (0)test_server_1_gpu.py::[wan2_1_t2v_1.3b_teacache_enabled …]; rocm720 2gpu (1)[mova_360p_tp2 / wan2_1_i2v_14b_480P/720P_2gpu].torch._scaled_mmHIPBLAS_STATUS_NOT_SUPPORTEDtest_server_2_gpu.py::[flux2_modelopt_fp8_tp2_t2i]); ~4 runsTokenizedGenerateReqInputmissinginput_embeds(TypeError)test_type_based_dispatcher.py)extra_buffer needs CUDA/MUSA/NPU (FLA)on ROCmhipErrorCapturedEventcapture aborttest_encoder_dp.py::test_vlm_mmmu_benchmark)test_qwen3p5_triton_dcp.py)test_deepseek_v32_basic.py::test_a_gsm8k)Known stable / dormant clusters (no action today) · click to expand
Infrastructure / orchestration noise (not test failures) · click to expand
429 Too Many Requests/ Xetxet_getstalls: dominate both nightlies Jun-27 — nightly-amd (DeepSeek-R1-MXFP4-tp2, GLM-5.1-mxfp4, gpt-oss-120b tokenizer filelock, perf-vlm Qwen3-VL-30B 429, MMMU dataset timeout) and nightly-rocm720 (DSR1-mxfp4-tp4, DSV3-0324, gpt-oss-120b, Grok-2, Qwen3-235B, DSV3.2 xet timeout, VLM 429). Partial fix #23400 open.cancel-in-progressgroup; HF-429 on stage-a/multimodal warmup → fast-fail cancels most downstream jobs. ≈0 clean pytest signal.--strict-portscascade: HF-download timeout on first diffusion test leaks scheduler port 5555 → cascades the rest of the 1-GPU shard (cosmos3/wan/lingbot/qwen-image). Manymultimodal-gen-1gpurows.libabsl_base.soinvalid ELF).Workflow drill-down (per-workflow view)
nightly-test-amd · Jun-27 [28297034445](https://github.com/sgl-project/sglang/actions/runs/28297034445) · ~8 real (rest HF-infra)
test_deepseek_v32_mtp_perf_mi35x.pytest_bench_one_batchtest_deepseek_v32_mtp_eval_mi35x.pysetUpClasstest_qwen35_eval_amd.pysetUpClasstest_qwen35_eval_mi35x.pytest_lm_evaltest_qwen3p5_triton_dcp.pytest_a_gsm8ktest_qwen3_instruct_mxfp4.pysetUpClasstest_deepseek_r1_hicache_mi35x.pytest_gsm8ktest_encoder_dp.pytest_vlm_mmmu_benchmarktest_kimi_k26_eval_amd.pynightly-test-amd-rocm720 · Jun-27 [28296988041](https://github.com/sgl-project/sglang/actions/runs/28296988041) · ~10 real (rest HF-infra)
test_deepseek_v32_mtp_perf_mi35x.pytest_bench_one_batchtest_deepseek_v32_mtp_eval_mi35x.pysetUpClasstest_qwen35_eval_amd.pysetUpClasstest_qwen35_eval_mi35x.pytest_lm_evaltest_qwen3_moe_eval_mi35x.pytest_qwen3_moe_accuracytest_qwen3_instruct_mxfp4.pysetUpClasstest_gsm8k_eval_amd.pytest_gsm8k_all_models(Mistral)test_kimi_k26_eval_amd.pytest_kimi_k26_gsm8k_accuracytest_deepseek_v4_pro_fp4.py/_cp.pysetUpClasspr-test-amd · rolling Jun-27→28 · latest [28306914695](https://github.com/sgl-project/sglang/actions/runs/28306914695)
test_server_{1,2}_gpu.pytest_diffusion_generation[wan2_*]test_server_2_gpu.py[flux2_modelopt_fp8_tp2_t2i]test_update_weights_from_disk.pytest_update_weights_specific_modulestest_type_based_dispatcher.pytest_type_dispatcher_e2e_performancetest_deepseek_r1_mxfp4_8gpu.pytest_a_gsm8ktest_qwen3_coder_next_8gpu.pytest_bs_1_speedtest_deepseek_v32_basic.pytest_a_gsm8kpr-test-amd-rocm720 · Jun-27 [28297015673](https://github.com/sgl-project/sglang/actions/runs/28297015673) · ≈0 clean signal (cron-cancel + HF-429)
Real signal buried under cron
cancel-in-progressself-cancellation + HF-429 fast-fail cascade: R226 (test_update_weights_from_disk.py500, 1gpu (3) 83838610262), R214 (test_type_based_dispatcher.py, stage-b (6) 83838610271), R231 (InductorError, 2gpu (1) 83838610250 + 1gpu (0) 83838610266), R222 (causal Conv3D I2V/mova, same 2gpu job). Plus a perf-threshold miss (stage-b-1gpu-large (0) 83838610248) and atest_start_profile_2watchdog/CUDA-graph-replay stall (stage-b (10) 83838610294) — both inconclusive without a clean rerun. Needs a non-colliding rerun for usable signal.How this report is generated
status == "completed"runs counted in trends. Both nightlies' Jun-27 runs treated as completed. Both release-docker workflows ✅ green; amd-aiter-scout did not run (cron Mon/Thu).update_weights500), R227 (DSR1-MXFP4 8-GPU MTP mem fault), R228 (Qwen3-Coder-Next 8-GPU hang), R229 (Kimi-K2.6 weight-load timeout, both nightlies), R230 (DSV4-Pro fp8 -9, rocm720), R231 (ROCmtorch.compileInductorError, confirmed from yesterday's minor-new).FACT/HIGH/MEDIUM/LOW/SPECULATION. Bot does NOT assign Priority — engineers decide from cluster size + persistence + fix availability.Generated by amd-bot · last updated 2026-06-28 02:42 UTC
Generated by amd-bot using Claude Code CLI (last updated: 2026-06-28 02:42 UTC)
CI Monitor — 2026-06-28
Repo: sgl-project/sglang
Monitored Workflows:
nightly-test-amd.ymlnightly-test-amd-rocm720.ymlrelease-docker-amd-nightly.ymlrelease-docker-amd-rocm720-nightly.ymlamd-aiter-scout.ymlpr-test-amd.ymlpr-test-amd-rocm720.ymlPer-workflow failure reports are appended as comments below; the Daily Cross-Workflow Summary is rendered above this section.