Fix profiling summary pipeline on GPU traces (marin-community#3362)

yonromai · web-flow · commit 77811cba1e52 · 2026-03-06T22:26:01.000-08:00
## Summary Fixes marin-community#3345. The profiling summary pipeline produced empty `hot_ops`, `communication_ops`, `gap_before_ops`, and `step_time` on GPU traces because of three independent mismatches between the ingestion code (written for TPU traces) and the actual GPU/NCCL trace format. | Cause | Before | After | |-------|--------|-------| | Thread filter mismatch | 6 functions gate on `"XLA Ops"` — GPU uses `Stream #N(...)` | New `_is_device_op_event()` matches both | | Comm op naming | `_COMM_PATTERNS` misses `ncclDevKernel_AllGather_RING_LL` etc. | Added `nccl`, `allgather`, `allreduce`, `reducescatter` | | No step markers | TPU uses `"Steps"` thread with numeric names — GPU has neither | `StepTraceAnnotation` in trainer + host-side `step_num` fallback in ingest | ## Changes ### `ingest.py` — thread and op recognition <details><summary>Cause 1: Replace 6 hardcoded thread checks with <code>_is_device_op_event()</code></summary> The old code filtered device ops with: ```python if event.thread_name not in {"XLA Ops", "Async XLA Ops"}: continue ``` GPU traces use stream-based thread names like `Stream #0(compute)`, `Stream #1(nccl)`, so every device op was silently dropped. New predicate: ```python _DEVICE_OP_THREAD_NAMES = frozenset({"XLA Ops", "Async XLA Ops"}) def _is_device_op_thread(thread_name: str | None) -> bool: if thread_name is None: return False if thread_name in _DEVICE_OP_THREAD_NAMES: return True if thread_name.startswith("Stream #"): return True return False def _is_device_op_event(event: _CompleteTraceEvent) -> bool: return _is_device_event(event) and _is_device_op_thread(event.thread_name) ``` Updated call sites: `_summarize_hot_ops`, `_summarize_communication`, `_summarize_pre_op_gaps`, `_summarize_hierarchical_regions`, `_summarize_gap_region_contexts`, `_preferred_region_path_by_op`. </details> <details><summary>Cause 2: NCCL collective classification</summary> Added to `_COMM_PATTERNS`: `"nccl"`, `"allgather"`, `"allreduce"`, `"reducescatter"`. Updated `_collective_kind()` to normalize unseparated NCCL names (e.g. `ncclDevKernel_AllGather_RING_LL` → `"all-gather"`). Updated `semantics.py` `collective` regex to match the same patterns for family classification. </details> <details><summary>Cause 3: Host-side step markers</summary> **Capture side** (`trainer.py`): Wrapped the compiled step body (the `_maybe_save_jaxpr` call) in `jax.profiler.StepTraceAnnotation("train", step_num=int(state.step))`. The annotation is scoped to the compiled step only — hooks, logging, and tracker calls happen outside — so the measured interval matches TPU device-side `"Steps"` semantics. **Ingest side** (`ingest.py`): Added `step_num: int | None` to `_CompleteTraceEvent`. When the TPU-style `"Steps"` thread produces no results, falls back to host-side events filtered to `name == "train"` on `/host:*` processes: ```python if not per_step: for event in events: if event.step_num is None: continue if event.name != "train": continue if not event.process_name or not event.process_name.startswith("/host:"): continue per_step[event.step_num].append(event.dur) ``` The fallback only fires when no TPU-style step markers exist, so existing TPU behavior is unchanged. </details> ### `semantics.py` — collective family regex Extended the `collective` family pattern to match NCCL naming (`nccl`, `allgather`, `allreduce`, `reducescatter`). ### `trainer.py` — step annotation Wrapped the compiled step body in `jax.profiler.StepTraceAnnotation("train", step_num=...)` inside `train_step()`. Scoped narrowly to exclude hooks/logging so GPU step timing is comparable to TPU device-side timing. ## Test plan ### Unit tests All 12 tests pass (11 existing TPU + 1 new GPU): ``` tests/profiling/test_profile_summary.py 12 passed ``` **New: `test_gpu_stream_threads_and_nccl_ops`** — synthetic GPU trace with `Stream #N` threads, NCCL kernel names, and host-side `step_num` events. Asserts: - `step_time.all_steps.count == 3` (host-side fallback works) - `hot_ops` contains `fusion.1` (stream threads recognized) - `communication_ops` contains `all-gather` and `reduce-scatter` (NCCL classified) - `gap_before_ops` is non-empty (gap analysis on stream threads) ### Pre-merge canary runs (end-to-end) Both canary workflows triggered on this branch via `workflow_dispatch`. Both passed. | Canary | Workflow run | Result | |--------|-------------|--------| | **TPU** (v5p-8, Qwen3 30M, 1B tokens) | [#22788717179](https://github.com/marin-community/marin/actions/runs/22788717179) | **Passed** — no regression | | **GPU** (8xH100 CW, Llama 150M, 1B tokens) | [#22788717705](https://github.com/marin-community/marin/actions/runs/22788717705) | **Passed** — all fields now populated | W&B runs: [`canary-tpu-22788717179-1`](https://wandb.ai/marin-community/marin/runs/canary-tpu-22788717179-1), [`canary-gpu-22788717705-1`](https://wandb.ai/marin-community/marin/runs/canary-gpu-22788717705-1) <details><summary>GPU canary results — before vs after</summary> Profile summaries downloaded from W&B and re-summarized locally with the branch code. #### GPU: all three causes fixed | Metric | Before (marin-community#3345) | After (this PR) | |--------|---------------|-----------------| | `hot_ops` | **0** | **25** | | `communication_ops` | **0** | **4 collective types, 1,208 events** | | `gap_before_ops` | **0** | **238** | | `step_time.all_steps.count` | **0** | **6** (median 303,642 us) | | `time_breakdown.communication` | 0.04% (misclassified) | **1.08%** | | `time_breakdown.compute` | 22% (inflated by NCCL) | **16.0%** | Top 5 hot ops (GPU): | Op | Exclusive duration (us) | Count | |----|------------------------|-------| | `sm90_xmma_gemm_f32f32_tf32f32_f32_nt_n_...cublas` | 1,730,500 | 768 | | `input_scatter_fusion_1` | 1,576,241 | 48 | | `loop_multiply_fusion_6` | 1,095,015 | 768 | | `sm90_xmma_gemm_f32f32_tf32f32_f32_tn_n_...cublas` | 1,020,830 | 768 | | `nvjet_tss_192x192_64x3_1x2_h_bz_coopB_NNN` | 965,529 | 768 | Communication ops (GPU): | Collective | Count | Total duration (us) | |-----------|-------|-------------------| | `all-reduce` | 104 | 594,030 | | `reduce-scatter` | 336 | 115,259 | | `all-gather` | 672 | 63,331 | | `send-recv` | 96 | 22,926 | Top 3 pre-op gaps (GPU): | Op | Total gap (us) | Count | |----|---------------|-------| | `MemcpyH2D` | 15,567,349 | 720 | | `ncclDevKernel_ReduceScatter_Sum_bf16_RING_LL(...)` | 10,286,880 | 336 | | `MemcpyD2H` | 6,191,228 | 142 | Step timing (GPU): | Stat | Value (us) | |------|-----------| | count | 6 | | min | 284,241 | | median | 303,642 | | mean | 303,200 | | max | 324,602 | | p90 | 314,282 | </details> <details><summary>TPU canary results — no regression</summary> | Metric | After (this PR) | |--------|-----------------| | `hot_ops` | **25** (fusion.*, copy.*, reshape.*) | | `communication_ops` | **4 types** (all-reduce: 520, all-gather: 2,040, all-to-all: 80, async-collective: 6,240) | | `gap_before_ops` | **461** | | `step_time.all_steps.count` | **60** (via TPU-native `"Steps"` thread — host-side fallback did not fire) | Top 5 hot ops (TPU): | Op | Exclusive duration (us) | Count | |----|------------------------|-------| | `fusion.492` | 2,858,184 | 640 | | `copy.319` | 1,143,337 | 640 | | `fusion.483` | 1,016,882 | 640 | | `reshape.436` | 836,632 | 640 | | `fusion.481` | 828,741 | 640 | Time breakdown (TPU): | Category | Share | |----------|-------| | Compute | 44.6% | | Stall | 50.6% | | Host | 4.5% | | Communication | 0.3% | </details> <details><summary>How to reproduce / deeply inspect</summary> #### Re-summarize from W&B artifacts ```bash # GPU canary uv run python -m marin.profiling.cli summarize \ --run-target "canary-gpu-22788717705-1" \ --entity marin-community --project marin # TPU canary uv run python -m marin.profiling.cli summarize \ --run-target "canary-tpu-22788717179-1" \ --entity marin-community --project marin ``` #### Verify step annotation scope The `StepTraceAnnotation` is scoped to the compiled step body only. To confirm hooks aren't included in the measured interval, look at the raw trace in the W&B artifact: - Find `name="train"` events with `step_num` in args on the `/host:CPU` process - Their `dur` should be consistent across steps (no periodic spikes on eval/checkpoint hook steps) - The GPU canary shows a tight step time distribution (min 284k, max 325k us, ~14% spread) — consistent with no hook contamination #### Verify TPU fallback did not fire The TPU canary has `step_time.all_steps.count = 60` with a median of ~1,327 us — these are device-side step markers from the `"Steps"` thread (microsecond-scale durations). The GPU canary has `step_time.all_steps.count = 6` with a median of ~303,642 us — these are host-side `StepTraceAnnotation` events (millisecond-scale). The two paths produce different-scale measurements as expected, confirming the fallback only fires on GPU. </details> ## Not addressed - **Trace truncation** (marin-community#3345, secondary): Both canary traces hit exactly 1,000,000 complete events (`suspected_truncation: true`). This is orthogonal — it affects data volume but not the format mismatches fixed here. Worth a separate PR to filter host threads at capture time or increase the cap.
diff --git a/lib/levanter/src/levanter/trainer.py b/lib/levanter/src/levanter/trainer.py
@@ -482,13 +482,15 @@ def train_step(self, state: S, *batch: X, **batch_kwargs) -> StepInfo[S]:
         hooks_this_time = any(state.step % h.every == 0 for h in self.hooks.jit_hooks)
 
         with capture_time() as step_time:
-            if hooks_this_time:
-                result = self._maybe_save_jaxpr("train_step", self._jit_train_step_fn, state, batch, batch_kwargs)
-                # force the loss so timing numbers are accurate. laziness isn't going to help here (i think?)
-            else:
-                result = self._maybe_save_jaxpr(
-                    "train_step_hooks", self._jit_train_step_fn_no_hook, state, batch, batch_kwargs
-                )
+            # Annotation scoped to the compiled step only (not hooks/logging below) so
+            # that GPU host-side step_num timing matches TPU device-side "Steps" semantics.
+            with jax.profiler.StepTraceAnnotation("train", step_num=int(state.step)):
+                if hooks_this_time:
+                    result = self._maybe_save_jaxpr("train_step", self._jit_train_step_fn, state, batch, batch_kwargs)
+                else:
+                    result = self._maybe_save_jaxpr(
+                        "train_step_hooks", self._jit_train_step_fn_no_hook, state, batch, batch_kwargs
+                    )
 
             loss = result.loss.item()
 
diff --git a/lib/marin/src/marin/profiling/ingest.py b/lib/marin/src/marin/profiling/ingest.py
@@ -63,8 +63,15 @@
     "psum",
     "send",
     "recv",
+    # GPU/NCCL-style (no separators)
+    "nccl",
+    "allgather",
+    "allreduce",
+    "reducescatter",
 )
 
+_DEVICE_OP_THREAD_NAMES = frozenset({"XLA Ops", "Async XLA Ops"})
+
 _STALL_PATTERN = re.compile(
     r"wait|barrier|dependency-wait|donation holds|semaphore|acquire|idle|blocked|sleep", re.IGNORECASE
 )
@@ -140,6 +147,7 @@ class _CompleteTraceEvent:
     run_id: str | None
     process_name: str | None
     thread_name: str | None
+    step_num: int | None
 
 
 @dataclass
@@ -550,6 +558,7 @@ def _parse_complete_events(
                 run_id=_string_like_arg(event.get("args"), "run_id"),
                 process_name=process_names.get(pid),
                 thread_name=thread_names.get((pid, tid)),
+                step_num=_int_like_arg(event.get("args"), "step_num"),
             )
         )
 
@@ -652,6 +661,8 @@ def _make_trace_provenance(events: list[_CompleteTraceEvent], *, trace_sha256: s
 
 def _summarize_step_times(events: list[_CompleteTraceEvent], *, warmup_steps: int) -> StepTimeSummary:
     per_step: dict[int, list[float]] = defaultdict(list)
+
+    # TPU path: device "Steps" thread with numeric event names.
     for event in events:
         if not _is_device_event(event):
             continue
@@ -663,6 +674,19 @@ def _summarize_step_times(events: list[_CompleteTraceEvent], *, warmup_steps: in
             continue
         per_step[step].append(event.dur)
 
+    # GPU fallback: host-side StepTraceAnnotation events (step_num in args).
+    # Filter to name="train" on /host:CPU to avoid averaging unrelated spans
+    # (e.g. device-side events that also carry step_num).
+    if not per_step:
+        for event in events:
+            if event.step_num is None:
+                continue
+            if event.name != "train":
+                continue
+            if not event.process_name or not event.process_name.startswith("/host:"):
+                continue
+            per_step[event.step_num].append(event.dur)
+
     averaged_steps: list[tuple[int, float]] = []
     for step, durations in per_step.items():
         if not durations:
@@ -823,9 +847,7 @@ def _summarize_hot_ops(
     aggregate: dict[str, dict[str, float | int | str | Counter[str] | list[float]]] = {}
 
     for event, exclusive_duration in zip(events, exclusive, strict=True):
-        if not _is_device_event(event):
-            continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
+        if not _is_device_op_event(event):
             continue
 
         bucket = aggregate.setdefault(
@@ -972,12 +994,10 @@ def _summarize_communication(events: list[_CompleteTraceEvent], exclusive: list[
     aggregate: dict[str, tuple[int, float]] = {}
 
     for event, duration in zip(events, exclusive, strict=True):
-        if not _is_device_event(event):
+        if not _is_device_op_event(event):
             continue
         if not _is_communication_name(event.name):
             continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
-            continue
 
         collective = _collective_kind(event.name)
         count, total = aggregate.get(collective, (0, 0.0))
@@ -1000,9 +1020,7 @@ def _summarize_pre_op_gaps(events: list[_CompleteTraceEvent], *, limit: int) ->
 
     by_track: dict[tuple[int, int], list[_CompleteTraceEvent]] = defaultdict(list)
     for event in events:
-        if not _is_device_event(event):
-            continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
+        if not _is_device_op_event(event):
             continue
         by_track[(event.pid, event.tid)].append(event)
 
@@ -1060,9 +1078,7 @@ def _summarize_hierarchical_regions(
     aggregate: dict[str, dict[str, float | int]] = {}
 
     for event, exclusive_duration in zip(events, exclusive, strict=True):
-        if not _is_device_event(event):
-            continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
+        if not _is_device_op_event(event):
             continue
 
         path_parts = _hierarchical_parts(event)
@@ -1140,9 +1156,7 @@ def _summarize_gap_region_contexts(events: list[_CompleteTraceEvent], *, limit:
 
     by_track: dict[tuple[int, int], list[_CompleteTraceEvent]] = defaultdict(list)
     for event in events:
-        if not _is_device_event(event):
-            continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
+        if not _is_device_op_event(event):
             continue
         by_track[(event.pid, event.tid)].append(event)
 
@@ -1562,13 +1576,27 @@ def _is_device_event(event: _CompleteTraceEvent) -> bool:
     return bool(event.process_name and event.process_name.startswith("/device:"))
 
 
+def _is_device_op_thread(thread_name: str | None) -> bool:
+    if thread_name is None:
+        return False
+    if thread_name in _DEVICE_OP_THREAD_NAMES:
+        return True
+    if thread_name.startswith("Stream #"):
+        return True
+    return False
+
+
+def _is_device_op_event(event: _CompleteTraceEvent) -> bool:
+    return _is_device_event(event) and _is_device_op_thread(event.thread_name)
+
+
 def _collective_kind(name: str) -> str:
     lowered = name.lower()
-    if "all-reduce" in lowered or "psum" in lowered:
+    if "all-reduce" in lowered or "allreduce" in lowered or "psum" in lowered:
         return "all-reduce"
-    if "all-gather" in lowered or "all_gather" in lowered:
+    if "all-gather" in lowered or "all_gather" in lowered or "allgather" in lowered:
         return "all-gather"
-    if "reduce-scatter" in lowered:
+    if "reduce-scatter" in lowered or "reducescatter" in lowered:
         return "reduce-scatter"
     if "all-to-all" in lowered or "alltoall" in lowered:
         return "all-to-all"
@@ -1681,9 +1709,7 @@ def _preferred_region_path_by_op(events: list[_CompleteTraceEvent], *, max_depth
     counters: dict[str, dict[str, int]] = defaultdict(dict)
 
     for event in events:
-        if not _is_device_event(event):
-            continue
-        if event.thread_name not in {"XLA Ops", "Async XLA Ops"}:
+        if not _is_device_op_event(event):
             continue
         if not event.tf_op:
             continue
@@ -1814,3 +1840,17 @@ def _string_like_arg(args_value: Any, key: str) -> str | None:
     if isinstance(value, (int, float)):
         return str(value)
     return None
+
+
+def _int_like_arg(args_value: Any, key: str) -> int | None:
+    if not isinstance(args_value, dict):
+        return None
+    value = args_value.get(key)
+    if isinstance(value, int):
+        return value
+    if isinstance(value, str):
+        try:
+            return int(value)
+        except ValueError:
+            return None
+    return None
diff --git a/lib/marin/src/marin/profiling/semantics.py b/lib/marin/src/marin/profiling/semantics.py
@@ -19,7 +19,8 @@
     (
         "collective",
         re.compile(
-            r"all-reduce|all_gather|all-gather|reduce-scatter|all-to-all|alltoall|collective",
+            r"all-reduce|all_gather|all-gather|reduce-scatter|all-to-all|alltoall|collective"
+            r"|nccl|allgather|allreduce|reducescatter",
             re.IGNORECASE,
         ),
     ),
diff --git a/tests/profiling/test_profile_summary.py b/tests/profiling/test_profile_summary.py
@@ -338,6 +338,59 @@ def test_gap_marker_payload_resolution_does_not_cross_second_idle_gap(tmp_path:
     assert top_gap.marker_op == "iota.296"
 
 
+def test_gpu_stream_threads_and_nccl_ops(tmp_path: Path) -> None:
+    """GPU traces use 'Stream #N' threads for ops and NCCL naming for collectives.
+
+    Step markers come from host-side StepTraceAnnotation events (step_num in args)
+    rather than the TPU-style 'Steps' thread with numeric event names.
+    """
+    trace_path = tmp_path / "gpu_trace.json.gz"
+    payload = {
+        "displayTimeUnit": "ns",
+        "traceEvents": [
+            # GPU device process with stream-based threads (no "XLA Ops" thread).
+            {"ph": "M", "pid": 1, "name": "process_name", "args": {"name": "/device:GPU:0"}},
+            {"ph": "M", "pid": 1, "tid": 10, "name": "thread_name", "args": {"name": "Stream #0(compute)"}},
+            {"ph": "M", "pid": 1, "tid": 11, "name": "thread_name", "args": {"name": "Stream #1(nccl)"}},
+            # Host process with step annotations.
+            {"ph": "M", "pid": 2, "name": "process_name", "args": {"name": "/host:CPU"}},
+            {"ph": "M", "pid": 2, "tid": 1, "name": "thread_name", "args": {"name": "python3"}},
+            # Step annotations on host (as produced by jax.profiler.StepTraceAnnotation).
+            {"ph": "X", "pid": 2, "tid": 1, "name": "train", "ts": 0, "dur": 500, "args": {"step_num": "0"}},
+            {"ph": "X", "pid": 2, "tid": 1, "name": "train", "ts": 500, "dur": 400, "args": {"step_num": "1"}},
+            {"ph": "X", "pid": 2, "tid": 1, "name": "train", "ts": 900, "dur": 350, "args": {"step_num": "2"}},
+            # Compute ops on Stream #0.
+            {"ph": "X", "pid": 1, "tid": 10, "name": "fusion.1", "ts": 10, "dur": 100},
+            {"ph": "X", "pid": 1, "tid": 10, "name": "custom-call.2", "ts": 120, "dur": 80},
+            # NCCL collective on Stream #1.
+            {"ph": "X", "pid": 1, "tid": 11, "name": "ncclDevKernel_AllGather_RING_LL", "ts": 200, "dur": 50},
+            {"ph": "X", "pid": 1, "tid": 11, "name": "ncclDevKernel_ReduceScatter_RING_LL", "ts": 260, "dur": 40},
+        ],
+    }
+    with gzip.open(trace_path, "wt", encoding="utf-8") as handle:
+        json.dump(payload, handle)
+
+    summary = summarize_trace(trace_path, warmup_steps=1, hot_op_limit=10)
+
+    # Step markers detected via host-side step_num fallback.
+    assert summary.step_time.all_steps.count == 3
+    assert summary.step_time.steady_state_steps.count == 2
+
+    # Ops from Stream threads are recognized (not empty like the old code would produce).
+    assert len(summary.hot_ops) > 0
+    op_names = {op.name for op in summary.hot_ops}
+    assert "fusion.1" in op_names
+
+    # NCCL collectives are classified.
+    assert len(summary.communication_ops) > 0
+    collective_kinds = {op.collective for op in summary.communication_ops}
+    assert "all-gather" in collective_kinds
+    assert "reduce-scatter" in collective_kinds
+
+    # Gap analysis works on stream threads.
+    assert len(summary.gap_before_ops) > 0
+
+
 def _write_trace(path: Path, *, step_durations: list[float], softmax_duration: float) -> None:
     path.parent.mkdir(parents=True, exist_ok=True)
     payload = {