|
| 1 | +# Video-MME CI: Post-Rebase Investigation Notes |
| 2 | + |
| 3 | +This note is the end-of-investigation handoff for the Video-MME CI work that |
| 4 | +landed in stage-7 of `.github/workflows/test-qwen3-omni-ci.yaml` plus the |
| 5 | +Talker-ON stage-8 that was **deliberately deferred**. It captures the full |
| 6 | +chain of decisions, the intermediate probes, and the walls that were hit so |
| 7 | +the next contributor can pick up from an informed starting point instead of |
| 8 | +retreading the same experiments. |
| 9 | + |
| 10 | +Scope of this doc: |
| 11 | + |
| 12 | +* Task 2: post-rebase recalibration of the thinker-only Video-MME CI |
| 13 | + (`tests/test_model/test_qwen3_omni_videomme_ci.py`). |
| 14 | +* Task 3: Talker-ON Video-MME TTS consistency CI — **deferred**, root cause |
| 15 | + located but not fixed within this PR. |
| 16 | + |
| 17 | +The thinker-only full-set reference lives in a sibling PR and is outside |
| 18 | +the scope of these notes. |
| 19 | + |
| 20 | + |
| 21 | +## Task 2 — Thinker-only CI, 50-sample subset @ concurrency=4 |
| 22 | + |
| 23 | +### The rebase event |
| 24 | + |
| 25 | +An earlier snapshot of this branch (pre-upstream-rebase) was calibrated against |
| 26 | +a main that predated three merged PRs: |
| 27 | + |
| 28 | +* `#318` — hardware-aware `mem_fraction_static` defaults for omni AR stages |
| 29 | +* `#319` — talker pipeline micro-batching |
| 30 | +* `#330` — thinker input-length check with parameter passing |
| 31 | + |
| 32 | +Rebasing the branch pulled those three in. Task 5's `--encoder-mem-reserve` |
| 33 | +(PR `#339`) also landed on main in that same window; the CI fixture now pins |
| 34 | +`--thinker-max-seq-len 32768` and `--encoder-mem-reserve 0.20` via CLI, so the |
| 35 | +*effective* server configuration post-rebase is bit-identical to the |
| 36 | +pre-rebase one. |
| 37 | + |
| 38 | +### Two 5-run calibrations, same fixture |
| 39 | + |
| 40 | +Each data point comes from a fresh `pytest` invocation of |
| 41 | +`test_videomme_accuracy_and_speed` — the `server_process` fixture starts and |
| 42 | +stops its own server, so every run sees a pristine GPU with no accumulated |
| 43 | +fragmentation. 5 back-to-back H200 runs pre-rebase, 5 more post-rebase: |
| 44 | + |
| 45 | +| window | acc set | correct/50 | failed | tput_qps range | tok_per_s_agg range | lat_mean_s range | |
| 46 | +| --- | --- | --- | --- | --- | --- | --- | |
| 47 | +| pre-rebase | {0.60, 0.60, 0.60, 0.60, 0.62} | {30,30,30,30,31} | 0 all | [0.078, 0.085] | [2.3, 2.6] | [46.5, 50.3] | |
| 48 | +| post-rebase | {0.62, 0.54, 0.58, 0.62, 0.58} | {31,27,29,31,29} | 0 all | [0.084, 0.087] | [2.5, 2.6] | [45.3, 47.1] | |
| 49 | + |
| 50 | +### What moved, and why |
| 51 | + |
| 52 | +* **Speed tightened across the board.** `tput_qps` came up, `tok_per_s_agg` |
| 53 | + came up, `lat_mean_s` came down. Attributable to `#319` (talker |
| 54 | + micro-batching kicks in even when the Talker is disabled because the |
| 55 | + thinker's scheduler shares some of the same micro-batch plumbing), plus |
| 56 | + `#318` (hardware-aware defaults let SGLang pick a slightly larger KV |
| 57 | + budget for the same inputs). The worst-of-5 speed numbers post-rebase |
| 58 | + are the new P95 feed into `apply_slack(0.75, 1.25)`. |
| 59 | +* **Accuracy spread widened.** Pre-rebase clustered at 0.60-0.62 (3-of-5 |
| 60 | + identical); post-rebase spans 0.54-0.62. Diffing per-sample correctness |
| 61 | + between run 1 (0.62) and run 2 (0.54): exactly 4 samples flipped from |
| 62 | + *correct* to *wrong*, with no `failed` requests on either run — i.e. |
| 63 | + the model's *text answer* for those 4 questions disagrees with itself |
| 64 | + between two back-to-back invocations of an otherwise bit-identical |
| 65 | + configuration. |
| 66 | + |
| 67 | +### Determinism caveats |
| 68 | + |
| 69 | +`random_seed=123` is set in `build_sglang_server_args` and sampling |
| 70 | +`temperature=0.0` is configured at the bench layer. Neither fully |
| 71 | +determinizes the thinker on H200 Hopper: |
| 72 | + |
| 73 | +* FA3 attention kernels do small amounts of non-deterministic reduction |
| 74 | + across the batch dimension even at `temperature=0`. |
| 75 | +* The MoE expert routing in Qwen3-Omni-30B uses top-k + bias; tie-breaks |
| 76 | + in that routing are non-deterministic on a multi-batch forward when |
| 77 | + two experts score within floating-point noise of each other. |
| 78 | +* `torchcodec` video frame decoding uses pthread work-stealing; the |
| 79 | + frame sub-sampling for a given timestamp can pick neighbour frames |
| 80 | + depending on worker scheduling. |
| 81 | + |
| 82 | +The pre-rebase calibration happened to land in a tight cluster; it was |
| 83 | +not a *promise* of 0.60 as a floor, it was a lucky sample. Post-rebase |
| 84 | +widens the real distribution enough that a strict 0.60 floor would flake |
| 85 | +~2 runs out of 5. |
| 86 | + |
| 87 | +### Decision |
| 88 | + |
| 89 | +`VIDEOMME_MIN_ACCURACY` dropped from `0.60` to `0.54` — worst-of-5 with |
| 90 | +no slack. The 5-run calibration data and the rationale for the delta is |
| 91 | +inlined in the test file's top-of-file comment; the commit message |
| 92 | +(`7be3339` on `Jayon02/issue-253-ci`) spells out the numeric before/after |
| 93 | +for each threshold. Any PR that loses a correct answer below that floor |
| 94 | +on a cold run fails the test. This is the same *shape* of threshold as |
| 95 | +the pre-rebase branch (worst-of-5, no slack on accuracy); only the |
| 96 | +concrete value moved, and only because the underlying non-determinism |
| 97 | +band moved. |
| 98 | + |
| 99 | + |
| 100 | +## Task 3 — Talker-ON Video-MME TTS consistency CI |
| 101 | + |
| 102 | +### The target |
| 103 | + |
| 104 | +Mirror `test_qwen3_omni_mmmu_tts_consistency_ci.py` / |
| 105 | +`test_qwen3_omni_mmsu_tts_consistency_ci.py` for Video-MME: launch the |
| 106 | +9-stage speech server (Talker ON), feed a handful of Video-MME samples |
| 107 | +through at `concurrency=4`, and assert: |
| 108 | + |
| 109 | +1. text accuracy (A/B/C/D MC answer matches ground truth), |
| 110 | +2. audio WER between the text output and the ASR transcript of the |
| 111 | + Talker's audio, |
| 112 | +3. failed-request count (zero-tolerance, like the sibling TTS CIs), |
| 113 | +4. per-concurrency speed thresholds via `apply_slack(0.75, 1.25)`. |
| 114 | + |
| 115 | +### 6 probes on H200, all on `Qwen3-Omni-30B-A3B-Instruct` |
| 116 | + |
| 117 | +Every probe ran `pytest` against a fresh |
| 118 | +`examples/run_qwen3_omni_speech_server.py` launch. None produced a |
| 119 | +single successful sample. |
| 120 | + |
| 121 | +| # | Config delta | What broke | |
| 122 | +| --- | --- | --- | |
| 123 | +| 1 | c=4, 50 samples, speech launcher's default `thinker_max_seq_len=8192` | Thinker input-length guard rejects a 9573-token Video-MME prompt; the rejection cascades through the pipeline relay and every in-flight request dies. | |
| 124 | +| 2 | c=4, 5 samples, `--thinker-max-seq-len 32768`, `--thinker-mem-fraction-static 0.55`, `--talker-mem-fraction-static 0.30` | First sample's Talker forward trips a CUDA "illegal memory access" inside FA3. CUDA context poisoned; all other samples fail. | |
| 125 | +| 3 | c=1, otherwise same as #2 | First sample's Talker forward trips `IndexKernel.cu:111 "-sizes[i] <= index && index < sizes[i] index out of bounds"` device-side assert. Still inside the Talker's prompt-state reconstruction path. | |
| 126 | +| 4 | #2 + `CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1` to pin the kernel source | The real origin surfaces as `_deps/repo-flash-attention-src/hopper/flash_fwd_launch_template.h:200: CUDA error: an illegal memory access was encountered`. So it's FA3 on Hopper mishandling the Talker's attention pattern for long-prompt inputs. | |
| 127 | +| 5 | #4 + `--talker-attention-backend triton` (newly added flag, also covers `mm_attention_backend`) | FA3 failure goes away. Failure *moves* to `IndexKernel.cu:111` on the first sample, within the Talker's prompt-state reconstruction. So the FA3 crash in #4 was one bug; there's at least one more CUDA `index_select` downstream that also OOBs on Video-MME prompts. | |
| 128 | +| 6 | #5 + patched `_load_prompt_token_embeddings` to bypass `torch.unique(sorted=False, return_inverse=True) + unique_rows.index_select(0, inverse)` with a direct per-token `stack` | Still `IndexKernel.cu:111` on the first sample. Whichever `index_select` is firing the assert, it is **not** the one in `_load_prompt_token_embeddings`. The next-most-likely culprit is the codec-embedding or projection path further into `_reconstruct_prompt_states` / `build_prefill_input`, but `CUDA_LAUNCH_BLOCKING=1` did not produce a Python-level traceback before the assert — the failing `index_select` runs inside a detached forward pass (likely `_talker_model.forward`). | |
| 129 | + |
| 130 | +### Why each dead-end did not save us |
| 131 | + |
| 132 | +* **FA3 → Triton (probe 5).** Fixed the specific FA3 crash but only |
| 133 | + moved the failure by one CUDA call. That told us the Talker path has |
| 134 | + at least two independent CUDA-level bugs for long-prompt inputs, not |
| 135 | + one; swapping an attention backend isn't sufficient. |
| 136 | +* **Embedding-lookup patch (probe 6).** `_load_prompt_token_embeddings` |
| 137 | + was our leading suspect because `torch.unique(sorted=False, |
| 138 | + return_inverse=True)` is a known foot-gun (inverse indices don't |
| 139 | + always match the returned unique order across PyTorch versions). |
| 140 | + Patching it out preserved the same failure signature, ruling it out. |
| 141 | +* **Concurrency c=4 → c=1 (probe 3).** The Talker already serializes on |
| 142 | + `code_predictor` / `code2wav` GPU access, so MMMU / MMSU TTS |
| 143 | + consistency tests use c=1 for a reason. Dropping to c=1 here did not |
| 144 | + paper over the bug, confirming it is not a race / oversubscription |
| 145 | + issue. |
| 146 | +* **Sample count 50 → 5 (probes 1→2).** Small sample count only affects |
| 147 | + *what* fails; the first sample still fails. Budget on samples alone |
| 148 | + cannot recover a path that cannot produce one success. |
| 149 | + |
| 150 | +### What MMMU / MMSU do that Video-MME cannot |
| 151 | + |
| 152 | +Both sibling TTS CIs exercise the identical Talker code path — same |
| 153 | +`_load_prompt_token_embeddings`, same |
| 154 | +`_reconstruct_prompt_states`, same `build_prefill_input`, same |
| 155 | +`codec_embed_fn`. The prompt length is the only meaningful difference: |
| 156 | + |
| 157 | +* MMMU image-QA prompts: ~300-1500 thinker tokens. |
| 158 | +* MMSU audio-QA prompts: ~500-1500 thinker tokens. |
| 159 | +* Video-MME prompts: 2000-9000 thinker tokens, driven by dense |
| 160 | + per-frame vision placeholder tokens from 32-64 sampled frames per |
| 161 | + clip. |
| 162 | + |
| 163 | +Both sibling CIs pass cleanly. The Talker's bug therefore only fires |
| 164 | +above some prompt-length threshold that sits between ~1500 and ~2000 |
| 165 | +thinker tokens. |
| 166 | + |
| 167 | +### Why we did not ship a weakened Task-3 CI |
| 168 | + |
| 169 | +Threshold calibration requires at least one end-to-end successful run |
| 170 | +to anchor the worst-of-N P95 bands. We have *zero* successful Talker-ON |
| 171 | +Video-MME runs. Two obvious weakenings were rejected: |
| 172 | + |
| 173 | +* **Truncating the Video-MME prompt to the MMMU length regime** defeats |
| 174 | + the purpose — the whole point of a Video-MME Talker CI is to exercise |
| 175 | + the Talker on *realistic* video prompts. A shortened variant would |
| 176 | + test nothing that MMMU / MMSU don't already test. |
| 177 | +* **Lowering accuracy / WER / speed thresholds until the failing run |
| 178 | + passes** produces a CI that reports green for a broken path. That is |
| 179 | + a regression, not a gate: the next genuine Talker regression would |
| 180 | + slip through silently because the existing "success criteria" already |
| 181 | + tolerate total failure. |
| 182 | + |
| 183 | +### Partial work kept |
| 184 | + |
| 185 | +Two changes in this PR are retained as net-positive even with the CI |
| 186 | +deferred: |
| 187 | + |
| 188 | +* `examples/run_qwen3_omni_speech_server.py` exposes |
| 189 | + `--thinker-max-seq-len`. The thinker-only launcher has had this flag |
| 190 | + for a while; the speech launcher was the outlier. Long-prompt |
| 191 | + workloads — including a future Talker-ON Video-MME CI, once the |
| 192 | + Talker bug is fixed — need a way to raise the Thinker context above |
| 193 | + the factory default without editing the config. |
| 194 | +* `examples/run_qwen3_omni_speech_server.py` exposes |
| 195 | + `--talker-attention-backend`. Pins the Talker stage's SGLang |
| 196 | + attention backend (and the matching `mm_attention_backend`) |
| 197 | + independently of the Thinker. It did **not** fix the Video-MME |
| 198 | + regression — probe 5 above shows the failure just moved — but the |
| 199 | + flag is the right shape for any future diagnostic work on the Talker |
| 200 | + path and is what produced the final piece of evidence that the bug |
| 201 | + is not attention-kernel-specific. |
| 202 | + |
| 203 | +The override-accumulator shape inside the speech launcher mirrors the |
| 204 | +thinker-only launcher so the next speech-launcher CLI flag drops in |
| 205 | +cleanly. |
| 206 | + |
| 207 | +### Unblocking criteria |
| 208 | + |
| 209 | +Any one of these clears the way for stage-8 to land: |
| 210 | + |
| 211 | +1. An upstream fix for the Talker `index_select` assert on long |
| 212 | + Video-MME prompts. Reproducer: run |
| 213 | + `tests/test_model/test_qwen3_omni_videomme_tts_consistency_ci.py` |
| 214 | + (from the probe branch) against a speech server launched with |
| 215 | + `--thinker-max-seq-len 32768`. With `CUDA_LAUNCH_BLOCKING=1` the |
| 216 | + assert appears at `IndexKernel.cu:111` on the first sample; the |
| 217 | + failing `index_select` call-site has not yet been isolated to a |
| 218 | + Python frame. |
| 219 | +2. A Talker-side input validator that clamps or rejects token IDs |
| 220 | + outside `codec_vocab_size` before any `codec_embed_fn(...)` or |
| 221 | + forward call, with a clear error instead of a silent OOB. |
| 222 | +3. An explicitly-truncated Video-MME subset (e.g., only "short" |
| 223 | + duration, aggressive frame subsampling) that empirically stays |
| 224 | + under the Talker's failing prompt length, with its own calibration |
| 225 | + run on H200 and its own threshold set documented as "Talker-ON |
| 226 | + subset" rather than "Video-MME". |
| 227 | + |
| 228 | +### Probe artifacts |
| 229 | + |
| 230 | +Every probe's server log was captured under |
| 231 | +`/tmp/t3_v{N}/basetemp/server_logs0/server.log` on the H200 host on |
| 232 | +2026-04-24. These are not committed; rerun the probe to regenerate. |
| 233 | + |
| 234 | + |
| 235 | +## Cross-reference |
| 236 | + |
| 237 | +* Task 2 commit: `7be3339` on `Jayon02/issue-253-ci` (`[CI] Add stage-7 |
| 238 | + videomme (thinker-only, Talker OFF)`) — threshold rationale with |
| 239 | + explicit before/after deltas. |
| 240 | +* Task 3 commit (this one): `[Docs] Defer stage-8 Video-MME TTS |
| 241 | + consistency CI; expose --thinker-max-seq-len and |
| 242 | + --talker-attention-backend on the speech server`. |
| 243 | +* Related PRs merged into main in the post-snapshot window: |
| 244 | + `#318`, `#319`, `#330`, `#339`. |
| 245 | +* Overarching CI-coverage tracking: issue `#253`. |
0 commit comments