Skip to content

[CI Monitor] Daily Report - 2026-06-28 #118

Description

@amd-bot

Daily Cross-Workflow Summary — 2026-06-28

Snapshot: 2026-06-28 02:42 UTC · Only completed runs counted in trends · Auto-updated every 30 min

TL;DR

🔴 RED · ~18 active clusters · 6 🆕 today (R226–R231) · ~12 carrying over · 4 review-ready fixes still open (none merged since yesterday) · scout did not run (cron Mon/Thu)
👉 Today's ask: R225 is now 2 days unfixed and still breaks DeepSeek-V3.2 MTP on BOTH nightlies — its breaker #29413 (9214b933) is merged with no revert/guard in flight (#29499 is an adjacent optimization, not a fix). Revert or ROCm-guard it. R222 (ROCm Conv3D CUDA-only) remains the largest pr-test-amd breaker (~16 job-blocks across 5 runs, still no fix). Land the 4 review-ready fixes that did NOT move yesterday: #29376 (R214), #27141+#29391 (R195), #28889 (R192), #27757 (R2). New today: a never-passed diffusion update_weights_from_disk 500 (R226, both PR workflows) + 3 latest-run-only 8-GPU pr-test-amd regressions (R227/R228/R219) needing a rerun to separate node-flake from regression. Both release-docker workflows ✅ green. pr-test-amd-rocm720 again ≈0 clean signal (cron self-cancel + HF-429 cascade).

Workflow status

Workflow Latest run Trend (completed real failures) Δ vs yesterday
nightly-test-amd Jun-27 28297034445 0 ~8 real (rest HF-infra) 10·10·10·~6·~8 +R229 NEW
nightly-test-amd-rocm720 Jun-27 28296988041 0 ~10 real (rest HF-infra) 12·11·~5·~10 +R229,R230 NEW
release-docker-amd-nightly Jun-27 (latest) 0 0·0·0 0
release-docker-amd-rocm720-nightly Jun-27 (latest) 0 0·0·0 0
amd-aiter-scout none (last Jun-25 28199192232) no run (not Mon/Thu)
pr-test-amd rolling Jun-27→28, latest 28306914695 0 R222+R192+R214+R226/run worsening (+R227/R228/R219 latest) +R226,R227,R228 NEW
pr-test-amd-rocm720 Jun-27 28297015673 0 ≈0 clean (cron-cancel + HF-429) ≈0 real +R231 confirmed

Notes: (1) amd-aiter-scout did not run (cron Mon/Thu); R221/R223 carry over dormant, no fresh data. (2) Both nightlies' Jun-27 runs are still dominated by HF weight-download hangs / 429 (infra), not code. (3) pr-test-amd-rocm720 run 28297015673 is again ≈0 clean signal: two crons share one cancel-in-progress group + an HF-429 fast-fail cascade cancels most downstream jobs.

🆕 NEW clusters today

R226 · 🆕 · Diffusion update_weights_from_disk → HTTP 500 "Inplace update to inference tensor outside InferenceMode" — pr-test-amd + pr-test-amd-rocm720

  • Status: NEW 2026-06-28; never-passed (new test file test_update_weights_from_disk.py). Appears in multimodal-gen 1-GPU shard 3 across 3 pr-test-amd runs and pr-test-amd-rocm720. A secondary fixture bug (perturbed-VAE clone missing transformer dir → setup ERRORs on FLUX.2) rides in the same job.
  • Top hypothesis: [LOW] server-side weight-apply path performs an in-place write on an inference-mode tensor (and a shape mismatch on FLUX.2/Qwen-Image), so every update_weights_from_disk request returns 500. Disconfirming: never-passed ⇒ could be a brand-new test exercising an unimplemented diffusion path rather than a regression. In-flight fix: ❌ none found.
  • Suggested triage: confirm whether this test was newly registered (never green anywhere) vs. regressed; if new, treat as feature-gap on the diffusion weight-update endpoint and route to the diffusion owners; fix the perturbed-VAE fixture to materialize a transformer dir.
Workflow Job (shard) Test File Test Function Error Log
pr-test-amd multimodal-gen-1gpu (3) test_update_weights_from_disk.py test_update_weights_specific_modules[Qwen-Image] (+4) 500 "Inplace update to inference tensor" Log
pr-test-amd multimodal-gen-1gpu (3) test_update_weights_from_disk.py test_update_weights_from_disk_default[Qwen-Image] (+ FLUX.2 setup ERRORs) 500 / fixture No weights dir for transformer Log
pr-test-amd-rocm720 multimodal-gen-1gpu (3) test_update_weights_from_disk.py TestUpdateWeightsFromDisk.test_update_weights_specific_modules[Qwen-Image] (+ offload variants) assert 500 == 200 Log

R229 · 🆕 · Kimi-K2.6 8-GPU eval TIMEOUT from slow weight load (3300s+ load exhausts 3600s budget) — both nightlies

  • Status: NEW 2026-06-28 (startup-bound TIMEOUT, not accuracy); regression since ~Jun 24-25. Disconfirming vs infra: this is a deterministic load-time-exceeds-budget, not a transient 429 — borderline between "model too big for budget" and a load-path slowdown.
  • Top hypothesis: [LOW] weight-load wall-clock for Kimi-K2.6 (8-way) now exceeds the per-file 3600s budget (load ~3303-3359s observed). In-flight fix: ❌ none (rocm720 per-job cited #24076/#29178/#28905 as candidates, none confirmed).
  • Suggested triage: bump the per-file timeout for Kimi-K2.6 OR pre-cache weights on the runner; profile load_model to see if a recent loader change slowed it.
Workflow Job (shard) Test File Test Function Error Log
nightly-test-amd nightly-8-gpu-kimi-k26 test_kimi_k26_eval_amd.py N/A (eval runner) TIMEOUT 3600s → exit 255 (load 3303s) Log
nightly-test-amd-rocm720 nightly-8-gpu-kimi-k26-rocm720 test_kimi_k26_eval_amd.py test_kimi_k26_gsm8k_accuracy TIMEOUT 3600s (load 3359s) Log

R230 · 🆕 · DeepSeek-V4-Pro server SIGKILL (-9) during 8-way fp8 weight load (MI35x ROCm 7.2) — nightly-rocm720

  • Status: NEW 2026-06-28 (first failure for these two files today; flaky). Both offline + online retry exit -9. In-flight fix: ❌ none.
  • Top hypothesis: [LOW] OOM / SIGKILL during 8-way fp8 weight load (host or device memory pressure on MI35x). Disconfirming: first-seen today ⇒ may be a one-off node memory issue; needs a rerun.
  • Suggested triage: rerun once; if it recurs, capture dmesg/OOM-killer logs and peak host RAM during load.
Workflow Job (shard) Test File Test Function Error Log
nightly-test-amd-rocm720 nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720 test_deepseek_v4_pro_fp4.py TestDeepseekV4ProFp4.setUpClass Server exit -9 (SIGKILL on load) Log
nightly-test-amd-rocm720 nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720 test_deepseek_v4_pro_fp4_cp.py TestDeepseekV4ProFp4CPInterleave.setUpClass Server exit -9 (offline+online -9) Log

R227 / R228 / R231 · 🆕 · latest-run-only pr-test regressions (need rerun to separate flake from regression)

  • R227 [LOW]DeepSeek-R1-MXFP4 8-GPU MTP prefill GPU memory access fault → watchdog → server killed, pr-test-amd latest run only. May share a root with R225 (spec/MTP path). Suggested triage: rerun; if persistent, bisect the MTP/spec-decode window.
  • R228 [LOW]Qwen3-Coder-Next 8-GPU decode hang → scheduler watchdog 300s → SIGQUIT → connection refused, pr-test-amd latest run only. Rerun to rule out node flake.
  • R231 [LOW]torch.compile InductorError (AssertionError in post-grad decompose_triton_kernel_wrapper_functional / layernorm forward_hip) on ROCm diffusion T2V/denoising, pr-test-amd-rocm720 (was "minor-new" yesterday, now confirmed in a 2nd run). Suggested triage: a clean non-colliding rerun; if it persists, it's a real ROCm torch.compile gap, not infra.
ID Workflow Job (shard) Test File Test Function Error Log
R227 pr-test-amd stage-c-8gpu-mi35x (0) test_deepseek_r1_mxfp4_8gpu.py TestDeepseekR1MXFP4MTP.test_a_gsm8k GPU mem fault → watchdog → killed Log
R228 pr-test-amd stage-c-8gpu-mi35x (1) test_qwen3_coder_next_8gpu.py TestQwen3CoderNext.test_bs_1_speed decode hang → SIGQUIT → conn refused Log
R231 pr-test-amd-rocm720 multimodal-gen-2gpu (1) test_server_2_gpu.py test_diffusion_generation[wan2_2_t2v_a14b_2gpu …] InductorError AssertionError Log

Carry-over active clusters (still red)

R225 · AssertionError "All of them must not be None" in DSA eager draft-extend (dsa_backend.py:721) — both nightlies, DeepSeek-V3.2 MTP

  • Status: 2 days persistent (since Jun-27); 4 jobs across both nightlies (MI35x perf + accuracy MTP). AMD/ROCm only (NVIDIA shielded by the CUDA-graph draft-extend path enabled in the same commit).
  • Top hypothesis: [HIGH] breaker #29413 (9214b933, merged Jun-27 06:53) CUDA-gates the new draft-extend graph consumer (_is_cuda or _is_musa) while leaving the AMD eager init_forward_metadata assert (lines 717-722) requiring the now-nulled CPU seq-len mirror. Disconfirming: end-to-end nulling of extend_*_cpu inferred, not traced. In-flight fix: ❌ none#29413 is merged; #29499 (open) is a DSA replay optimization, NOT a revert/guard.
  • Suggested triage: revert 9214b933 on a branch + rerun test_deepseek_v32_mtp_perf_mi35x.py; if confirmed, derive extend_*_cpu from GPU tensors in the eager is_draft_extend_v2 branch or force needs_cpu_seq_lens=True on ROCm. Ping the #29413 author.
Workflow Job (shard) Test File Test Function Error Log
nightly-test-amd nightly-perf-8gpu-mi35x-dsv32-mtp test_deepseek_v32_mtp_perf_mi35x.py test_bench_one_batch dsa_backend.py:721 → exit -9 Log
nightly-test-amd nightly-acc-8gpu-mi35x-dsv32-mtp test_deepseek_v32_mtp_eval_mi35x.py TestDeepseekV32TPMTP.setUpClass same Log
nightly-test-amd-rocm720 nightly-perf-8gpu-mi35x-dsv32-mtp-rocm720 test_deepseek_v32_mtp_perf_mi35x.py test_bench_one_batch same Log
nightly-test-amd-rocm720 nightly-acc-8gpu-mi35x-dsv32-mtp-rocm720 test_deepseek_v32_mtp_eval_mi35x.py TestDeepseekV32TPMTP.setUpClass same Log

R222 · ROCm RuntimeError "causal Conv3D cat/pad fusion is only available on CUDA" (Wan/diffusion VAE) — pr-test-amd (largest) + rocm720

  • Status: every pr-test-amd run since #29281 (merged Jun-26); ~16 job-blocks across 5 runs (28297966351, 28289294173, 28282108241, 28273478888, 28306914695) + rocm720 (83838610250 I2V/mova). Hits all Wan2.1/2.2 T2V/I2V + mova variants on 1-GPU and 2-GPU shards.
  • Top hypothesis: [MEDIUM] #29281 added a CUDA-only fused causal-Conv3D fast path in WanVAE decode with no ROCm/Triton fallback. In-flight fix: ❌ none (no Conv3D-guard PR open; the Conv3D search returned only unrelated diffusion PRs).
  • Suggested triage: guard the fused path behind is_cuda with an eager fallback, or revert #29281; rerun test_server_2_gpu.py::test_diffusion_generation[wan2_2_t2v_a14b_2gpu].

Representative rows (all shards share the same top frame): pr-test-amd 2gpu (1) test_server_2_gpu.py::test_diffusion_generation[wan2_2_i2v_a14b_2gpu …]; 1gpu (0) test_server_1_gpu.py::[wan2_1_t2v_1.3b_teacache_enabled …]; rocm720 2gpu (1) [mova_360p_tp2 / wan2_1_i2v_14b_480P/720P_2gpu].

ID Cluster Where (latest) Status In-flight fix
R192 FLUX.2 modelopt-FP8 torch._scaled_mm HIPBLAS_STATUS_NOT_SUPPORTED pr-test-amd 2gpu (1) (test_server_2_gpu.py::[flux2_modelopt_fp8_tp2_t2i]); ~4 runs never-passed #28889 open — land
R214 TokenizedGenerateReqInput missing input_embeds (TypeError) pr-test-amd stage-b-1gpu (6) + rocm720 stage-b (6) (test_type_based_dispatcher.py) recurring since #29214 #29376 open — unblock & land
R195 Mamba extra_buffer needs CUDA/MUSA/NPU (FLA) on ROCm nightly qwen35 83838643968, mi35x-qwen35 83838643970; rocm720 83838515323, 83838515348 persistent ≥Jun-19 #27141+#29391 open — land
R19 Qwen3-235B-MXFP4 HIP hipErrorCapturedEvent capture abort nightly 83838643963; rocm720 83838515337 never-passed ≥May-27 ❌ none (per-job: #27650/#23581 candidates)
R2 Mistral/Mixtral GSM8K below threshold (chat-eval) rocm720 83838515256 (Mistral-7B 0.361) never-passed ≥Jun-13 #27757 open — land
R211 DeepSeek-R1 HiCache MI35x — GPU mem fault during gsm8k prefill nightly 83838643952 never-passed ≥Jun-20 ❌ none
R196 VLM DP-encoder mem fault (write to read-only page) nightly 4-gpu 83838643955 (test_encoder_dp.py::test_vlm_mmmu_benchmark) flaky/model-dependent ⚠️ #18721 stale
R6 Qwen3-30B-A3B MoE — GPU mem fault (MI35x) rocm720 83838515311 recurring (4/5; last pass Jun-23) ❌ none
R210 Qwen3.5 triton-DCP GSM8K 0.556<0.90 nightly mi35x-qwen35 83838643970 (test_qwen3p5_triton_dcp.py) never-passed ⚠️ #29230 DNM
R219 DeepSeek-V3.2 (basic) 8-GPU HSA out-of-resources decode abort pr-test-amd stage-c-8gpu (1) (test_deepseek_v32_basic.py::test_a_gsm8k) latest-run flake ❌ none
Known stable / dormant clusters (no action today) · click to expand
ID Cluster Where Status Fix
R1 VLM MMMU accuracy below threshold nightly (today masked by MMMU dataset/429 timeouts) never-passed ≥Jun-13 ❌ none
R155 DeepSeek-V3.2 (basic) MI35x GSM8K below threshold rocm720 (today masked by xet download timeout) never-passed on rocm720 ⚠️ partial #25559/#29050
R213 MiniMax-M2.7 GSM8K borderline nightly borderline/flaky ❌ none
R220 Embeddings-API latency threshold pr-test-amd stage-b-1gpu-large flake (not seen today) ❌ none
R221 aiter-caused GPU Hang (exit 134) ROCm 7.2 LoRA scout only — no run today dormant ❌ none
R223 aiter-caused DSV4-Pro-MTP connection-refused scout only — no run today dormant ❌ none
R212/R224 DSV3.2-MTP perf hang / eval borderline superseded by R225 on MTP jobs dormant ❌ none
Infrastructure / orchestration noise (not test failures) · click to expand
  • HF weight-download hangs / 429 Too Many Requests / Xet xet_get stalls: dominate both nightlies Jun-27 — nightly-amd (DeepSeek-R1-MXFP4-tp2, GLM-5.1-mxfp4, gpt-oss-120b tokenizer filelock, perf-vlm Qwen3-VL-30B 429, MMMU dataset timeout) and nightly-rocm720 (DSR1-mxfp4-tp4, DSV3-0324, gpt-oss-120b, Grok-2, Qwen3-235B, DSV3.2 xet timeout, VLM 429). Partial fix #23400 open.
  • pr-test-amd-rocm720 cron self-cancel + HF-429 cascade: run 28297015673 — two crons share one cancel-in-progress group; HF-429 on stage-a/multimodal warmup → fast-fail cancels most downstream jobs. ≈0 clean pytest signal.
  • pr-test-amd diffusion port-5555 --strict-ports cascade: HF-download timeout on first diffusion test leaks scheduler port 5555 → cascades the rest of the 1-GPU shard (cosmos3/wan/lingbot/qwen-image). Many multimodal-gen-1gpu rows.
  • ROCm VRAM-not-clear / zombie KFD pre-flight gate: nightly glm5-mxfp4 83838644036, rocm720 hicache 83838515342. Node reboot required.
  • mori build / git-clone network fail: pr-test-amd 83799696814 (corrupt libabsl_base.so invalid ELF).
  • Kimi-K2-MXFP4 BCG watchdog -9 (pr-test-amd 83799697008): flaky (1/6) MoE weight-load watchdog timeout.

Workflow drill-down (per-workflow view)

nightly-test-amd · Jun-27 [28297034445](https://github.com/sgl-project/sglang/actions/runs/28297034445) · ~8 real (rest HF-infra)
Job (shard) Test File Test Function Cluster Error
nightly-perf-8gpu-mi35x-dsv32-mtp test_deepseek_v32_mtp_perf_mi35x.py test_bench_one_batch R225 DSA assert → exit -9
nightly-acc-8gpu-mi35x-dsv32-mtp test_deepseek_v32_mtp_eval_mi35x.py setUpClass R225 DSA assert
nightly-8-gpu-qwen35 test_qwen35_eval_amd.py setUpClass R195 extra_buffer assert
nightly-8-gpu-mi35x-qwen35 test_qwen35_eval_mi35x.py test_lm_eval R195 extra_buffer assert
nightly-8-gpu-mi35x-qwen35 test_qwen3p5_triton_dcp.py test_a_gsm8k R210 gsm8k 0.556<0.90
nightly-8-gpu-mi35x-qwen3-235b-mxfp4 test_qwen3_instruct_mxfp4.py setUpClass R19 HIP capture -6
nightly-8-gpu-mi35x-deepseek-r1-hicache test_deepseek_r1_hicache_mi35x.py test_gsm8k R211 GPU mem fault
nightly-4-gpu test_encoder_dp.py test_vlm_mmmu_benchmark R196 write to read-only page -9
nightly-8-gpu-kimi-k26 test_kimi_k26_eval_amd.py N/A R229🆕 TIMEOUT 3600s (load 3303s)
(dsr1-mxfp4-tp2, glm51-mxfp4, gpt-oss-120b, perf-vlm 429, mmmu, glm5-mxfp4 VRAM gate) various infra HF download / 429 / VRAM gate
nightly-test-amd-rocm720 · Jun-27 [28296988041](https://github.com/sgl-project/sglang/actions/runs/28296988041) · ~10 real (rest HF-infra)
Job (shard) Test File Test Function Cluster Error
nightly-perf-8gpu-mi35x-dsv32-mtp-rocm720 test_deepseek_v32_mtp_perf_mi35x.py test_bench_one_batch R225 DSA assert
nightly-acc-8gpu-mi35x-dsv32-mtp-rocm720 test_deepseek_v32_mtp_eval_mi35x.py setUpClass R225 DSA assert
nightly-8-gpu-qwen35-rocm720 test_qwen35_eval_amd.py setUpClass R195 extra_buffer assert
nightly-8-gpu-mi35x-qwen35-rocm720 test_qwen35_eval_mi35x.py test_lm_eval R195 extra_buffer assert
nightly-acc-8gpu-mi35x-rocm720 test_qwen3_moe_eval_mi35x.py test_qwen3_moe_accuracy R6 GPU mem fault -6
nightly-8-gpu-mi35x-qwen3-235b-mxfp4-rocm720 test_qwen3_instruct_mxfp4.py setUpClass R19 hipErrorCapturedEvent
nightly-accuracy-2-gpu-rocm720 test_gsm8k_eval_amd.py test_gsm8k_all_models (Mistral) R2 gsm8k 0.361
nightly-8-gpu-kimi-k26-rocm720 test_kimi_k26_eval_amd.py test_kimi_k26_gsm8k_accuracy R229🆕 TIMEOUT (load 3359s)
nightly-8-gpu-mi35x-deepseek-v4-pro-rocm720 test_deepseek_v4_pro_fp4.py / _cp.py setUpClass R230🆕 server exit -9 (load)
(dsr1-mxfp4-tp4, dsv3.1, dsv3.2 xet, grok2, qwen3-235b, gpt-oss, vlm-429, hicache VRAM gate) various infra HF download / 429 / VRAM gate
pr-test-amd · rolling Jun-27→28 · latest [28306914695](https://github.com/sgl-project/sglang/actions/runs/28306914695)
Job (shard) Test File Test Function Cluster Error
multimodal-gen-2gpu (1) ×~16 blocks/5 runs test_server_{1,2}_gpu.py test_diffusion_generation[wan2_*] R222 causal Conv3D CUDA-only
multimodal-gen-2gpu (1) test_server_2_gpu.py [flux2_modelopt_fp8_tp2_t2i] R192 HIPBLAS_STATUS_NOT_SUPPORTED
multimodal-gen-1gpu (3) test_update_weights_from_disk.py test_update_weights_specific_modules R226🆕 500 inplace-on-inference tensor
stage-b-1gpu-small (6) test_type_based_dispatcher.py test_type_dispatcher_e2e_performance R214 TypeError (input_embeds)
stage-c-8gpu-mi35x (0) test_deepseek_r1_mxfp4_8gpu.py test_a_gsm8k R227🆕 GPU mem fault (MTP)
stage-c-8gpu-mi35x (1) test_qwen3_coder_next_8gpu.py test_bs_1_speed R228🆕 decode hang → SIGQUIT
stage-c-8gpu (1) test_deepseek_v32_basic.py test_a_gsm8k R219 HSA out-of-resources
(diffusion port-5555 cascades, mori build, kimi-mxfp4 watchdog) various infra downloads / cascades
pr-test-amd-rocm720 · Jun-27 [28297015673](https://github.com/sgl-project/sglang/actions/runs/28297015673) · ≈0 clean signal (cron-cancel + HF-429)

Real signal buried under cron cancel-in-progress self-cancellation + HF-429 fast-fail cascade: R226 (test_update_weights_from_disk.py 500, 1gpu (3) 83838610262), R214 (test_type_based_dispatcher.py, stage-b (6) 83838610271), R231 (InductorError, 2gpu (1) 83838610250 + 1gpu (0) 83838610266), R222 (causal Conv3D I2V/mova, same 2gpu job). Plus a perf-threshold miss (stage-b-1gpu-large (0) 83838610248) and a test_start_profile_2 watchdog/CUDA-graph-replay stall (stage-b (10) 83838610294) — both inconclusive without a clean rerun. Needs a non-colliding rerun for usable signal.

How this report is generated

  • Only status == "completed" runs counted in trends. Both nightlies' Jun-27 runs treated as completed. Both release-docker workflows ✅ green; amd-aiter-scout did not run (cron Mon/Thu).
  • 🆕 NEW today: R226 (diffusion update_weights 500), R227 (DSR1-MXFP4 8-GPU MTP mem fault), R228 (Qwen3-Coder-Next 8-GPU hang), R229 (Kimi-K2.6 weight-load timeout, both nightlies), R230 (DSV4-Pro fp8 -9, rocm720), R231 (ROCm torch.compile InductorError, confirmed from yesterday's minor-new).
  • Carrying over: R225 now 2 days unfixed (breaker #29413 merged, no revert in flight); R222/R195/R214/R192/R2/R19/R211/R196/R6/R210/R219.
  • In-flight fixes unchanged since yesterday (none merged): #29376 (R214), #27141+#29391 (R195), #28889 (R192), #27757 (R2).
  • Confidence: FACT/HIGH/MEDIUM/LOW/SPECULATION. Bot does NOT assign Priority — engineers decide from cluster size + persistence + fix availability.

Generated by amd-bot · last updated 2026-06-28 02:42 UTC


Generated by amd-bot using Claude Code CLI (last updated: 2026-06-28 02:42 UTC)


CI Monitor — 2026-06-28

Repo: sgl-project/sglang

Monitored Workflows:

  • nightly-test-amd.yml
  • nightly-test-amd-rocm720.yml
  • release-docker-amd-nightly.yml
  • release-docker-amd-rocm720-nightly.yml
  • amd-aiter-scout.yml
  • pr-test-amd.yml
  • pr-test-amd-rocm720.yml

Per-workflow failure reports are appended as comments below; the Daily Cross-Workflow Summary is rendered above this section.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions