yingguo-trt · pull · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/.claude/skills/ad-model-onboard/SKILL.md b/.claude/skills/ad-model-onboard/SKILL.md
@@ -303,6 +303,143 @@ GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state
 
 **Do NOT stop polling prematurely.** The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.
 
+## Sharding-aware IR model porting (`modeling_*_ir.py`)
+
+Use this when porting an existing AutoDeploy custom model (`tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py`) to explicit sharding hint ops in `modeling_*_ir.py` **in the same directory** (no separate `new_sharding/` tree). The exported FX graph must fully specify how the model should be sharded: the `apply_sharding_hints` transform combines hints with a runtime `DistConfig` for deterministic, node-local sharding.
+
+**Argument reference:** Do not duplicate operator tables here. Refer to the custom op docstrings in `tensorrt_llm/_torch/auto_deploy/custom_ops/` for the complete argument reference (including sharding hints, `tp_mode`, `layer_type`, and which ops accept hints).
+
+### Reference examples (study before porting)
+
+| Original | IR / sharding-aware | Layer types |
+|----------|---------------------|-------------|
+| `modeling_nemotron_h.py` | `modeling_nemotron_h_ir.py` | Mamba SSM, MHA, SwiGLU MLP, MoE |
+| `modeling_qwen3_5_moe.py` | `modeling_qwen3_5_moe_ir.py` | GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE |
+| `modeling_mistral.py` | `modeling_mistral_ir.py` | MHA, SwiGLU MLP (simplest) |
+| `modeling_deepseek_v2.py` | `modeling_deepseek_v2_ir.py` | MLA, SwiGLU MLP, MoE |
+
+### Step-by-step porting procedure
+
+#### Step 1: Copy the source file
+
+```bash
+cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
+   tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py
+```
+
+#### Step 2: Update the module docstring and add imports
+
+At the top of the IR file:
+
+```python
+import tensorrt_llm._torch.auto_deploy.custom_ops  # noqa: F401 -- register all ops
+```
+
+Do **not** add global `SHARD_*` flags. Layer-level control uses the `layer_type` hint on each op and `shard_layers` in YAML.
+
+#### Step 3: Replace linear projections
+
+For every `self.proj(x)` or `nn.Linear` call, use `torch.ops.auto_deploy.torch_linear_simple` with explicit `tp_mode` and `layer_type`. Always set `tp_mode` unconditionally (no `if _s else "none"`). **Rules:** opening projections (Q/K/V/gate/up/in_proj) → `"colwise"`; closing (O/down/out_proj) → `"rowwise"`; tiny outputs (e.g. `shared_expert_gate` dim 1) → `"none"`; MLA latent projections (q_a, kv_a) → `"none"`. For fused weights split later, pass `output_sizes=[...]`. For GQA, use `tp_min_local_shape=self.head_dim` on K/V colwise lines.
+
+#### Step 4: Replace split / chunk after fused colwise projections
+
+Use `torch.ops.auto_deploy.split_with_sizes` with `shardable` / `layer_type` where sizes scale with TP.
+
+#### Step 5: Replace view / reshape with concrete head counts
+
+During `torch.export`, `-1` becomes concrete; after TP, wrong values break. Any reshape whose dimension is a head count that scales with TP must use `torch.ops.auto_deploy.view` with `tp_scaled_dim` set appropriately. Safe cases: flat-to-2D, or `[B,S,-1]` when the input is already correctly sharded.
+
+#### Step 6: Insert `all_reduce`
+
+After every rowwise projection, add `torch.ops.auto_deploy.all_reduce(..., layer_type=...)`. **Parallel branch rule:** when branches merge by addition, use a **single** `all_reduce` after the sum (e.g. MoE routed + shared expert; parallel attention + MLP residual branches).
+
+#### Step 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)
+
+Add sharding hints on `torch_causal_conv1d`, `torch_ssm`, `torch_gated_delta_rule`, `torch_rmsnorm_gated` per docstrings—typically `shardable` / `output_sizes` / `tp_mode` as required.
+
+#### Step 8: MoE
+
+Pass `layer_type="moe"` into `torch_moe`; `apply_sharding_hints` handles EP/TP.
+
+#### Step 9: Register the IR model
+
+1. Bottom of the IR file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)` (same pattern as Phase 4).
+2. Add a **side-effect import** in `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` (e.g. `from . import modeling_foo_ir  # noqa: F401`) and extend `__all__` if you export symbols. Without this import, worker processes may not load your class and `apply_sharding_hints` can report **0 nodes processed**. Do **not** use a separate `register_sharded_models.py` indirection.
+
+#### Step 10: YAML — composable registry pattern
+
+Prefer the model registry (`examples/auto_deploy/model_registry/models.yaml`) and **compose** shared fragments under `examples/auto_deploy/model_registry/configs/`, same as other models: list `dashboard_default.yaml`, the right `world_size_N.yaml`, then a dedicated fragment (e.g. `enable_sharder_ir.yaml`) that holds IR sharding transforms. That fragment should disable legacy sharding passes and enable hint-driven sharding. Registry fragments are deep-merged in `yaml_extra` order (see `DynamicYamlMixInForSettings` in `tensorrt_llm/_torch/auto_deploy/utils/_config.py`); place transform keys under `transforms:` so they merge with `dashboard_default.yaml`. Standalone experiment YAMLs for `build_and_run_ad` may wrap the same fields under a top-level `args:` block matching `LlmArgs`.
+
+Example transform block:
+
+```yaml
+# Typical contents for enable_sharder_ir.yaml (registry composable fragment)
+transforms:
+  export_to_gm:
+    num_moe_experts_for_export: 2   # often required when expert count is large (>64)
+  detect_sharding:
+    stage: sharding
+    enabled: false
+  sharding_transform_executor:
+    stage: sharding
+    enabled: false
+  apply_sharding_hints:
+    stage: sharding
+    enabled: true
+    run_shape_prop: true
+    allreduce_strategy: NCCL
+    # shard_layers: ['mha', 'mlp']   # optional selective sharding
+  gather_logits_before_lm_head:
+    enabled: true
+```
+
+Use `world_size: 8` when validating TP head-divisibility. Optional `shard_layers` limits which `layer_type` hints are processed; unset means shard all shardable nodes.
+
+#### Step 11: Validate
+
+Do not report success until a run completes successfully.
+
+1. Prefer `python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry` after adding/updating the registry entry and composable YAMLs (Phase 8–9 style).
+2. `apply_sharding_hints` logs should show **`N nodes processed` with N > 0**.
+3. If validation fails with infrastructure limits (e.g. head count not divisible by `world_size`), document the assert and compatible sizes; do not “fix” core `sharding.py` / custom op schemas without owner review.
+4. If blocked by missing infrastructure support, rename artifacts to `broken_modeling_*_ir.py` / broken YAML and file a short error report for humans (do not silently patch core transforms).
+
+**Layer type strings** (for `layer_type` / `shard_layers`): use `"mha"`, `"mla"`, `"mlp"`, `"moe"`, `"ssm"`, `"delta"`, or `"unknown"` (default; skipped when `shard_layers` is set). Match the conventions used in `apply_sharding_hints` and project enums.
+
+### Layer-specific sharding patterns
+
+**MHA (standard or gated):** `layer_type="mha"`: q/k/v colwise (GQA: `tp_min_local_shape`), `view` with `tp_scaled_dim` for head dim, o rowwise + `all_reduce`. Fused Q+gate interleaved per head: colwise without `output_sizes`; contiguous Q|K|V fused blocks need `output_sizes`.
+
+**SwiGLU MLP:** `layer_type="mlp"`: gate/up colwise, down rowwise + `all_reduce`.
+
+**Mamba / SSM:** `layer_type="ssm"`: in_proj colwise + `output_sizes`, splits shardable, conv1d shardable + `output_sizes`, views, `torch_ssm` shardable, norm gated colwise if weight scales, out rowwise + `all_reduce`.
+
+**GatedDeltaNet:** `layer_type="delta"`: in_proj_qkv with `output_sizes`, other in_projs colwise, conv1d/splits/views as above, `torch_gated_delta_rule` shardable, out rowwise + `all_reduce`.
+
+**MoE + shared expert:** `layer_type="moe"`: router replicated; one `all_reduce` after `routed + shared`, not two.
+
+**MLA (DeepSeek):** `layer_type="mla"`: keep `torch_mla` intact with `shardable=True`—do **not** decompose into separate linears + `torch_attention` (introduces bad `expand`/`view` with concrete head counts). q_a/kv_a latent: `tp_mode="none"`; q_b colwise; `o_proj` rowwise + `all_reduce`.
+
+### Common pitfalls (sharding IR)
+
+1. **Missing `auto_deploy::view` for head reshapes** — concrete shapes from export break after sharding.
+2. **Sharding tiny projections** — dim-1 gates: `tp_mode="none"`.
+3. **Double `all_reduce` in MoE** — one merge-point reduction for routed + shared.
+4. **Cross-layer parameter contamination** — in `_apply_hint_*` handlers using `get_source_nodes()`, restrict with `allowed_ops` so residual links do not pull weights from other layers.
+5. **Missing `num_moe_experts_for_export`** for very large expert counts — export can hang.
+6. **Decomposing ops that absorb weights** (e.g. `torch_mla`) — use `shardable` + handler instead of splitting into plain linears.
+7. **Interleaved vs contiguous fused weights** — interleaved per-head groups: colwise only; contiguous Q|K|V blocks: require `output_sizes`.
+8. **Omitting `layer_type` when using `shard_layers`** — `"unknown"` nodes are skipped; set hints explicitly on sharding-aware ops.
+9. **`layer_type` on non-hint ops** — do **not** pass `layer_type` to ops that are not designed for sharding hints (e.g. `torch_attention`, `torch_l2norm`, `torch_rope_*`); extra positional args break calls. Confirm in `custom_ops/` docstrings which ops accept hints.
+10. **Conditional hint values** — no `if _s else "none"`; use unconditional hints and rely on `shard_layers` / transform config.
+
+### Sharding IR validation checklist (human review)
+
+- `world_size=1`: unsharded path; hints should not break correctness.
+- `world_size=2` and `8`: shape checks and coherent output.
+- `apply_sharding_hints` node count vs expectation.
+- Optional: `shard_layers: ['moe']` to verify selective sharding.
+
 ## Key Gotchas
 - **Canonical ops first:** Always use `torch.ops.auto_deploy.torch_*` canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
 - **No `repeat_interleave`:** AD attention ops handle GQA natively. Never repeat K/V heads manually.

diff --git a/.claude/skills/ci-failure-retrieval/SKILL.md b/.claude/skills/ci-failure-retrieval/SKILL.md
@@ -10,56 +10,136 @@ metadata:
 
 **Input:** a PR number or a request to check CI failures. **Auth requirement:** requires corporate network access to resolve the Jenkins base URL. **Output:** a summary of failed tests with error details, and optionally full stdout/stderr for specific failures.
 
-## Phase 1 — Get the Jenkins Build Number
+## Important: SSL and Authentication
+
+Jenkins requires SSL with certificate verification disabled. Always use `ssl` context bypass in Python or `-sk` flags in curl:
+```python
+import ssl
+ctx = ssl.create_default_context()
+ctx.check_hostname = False
+ctx.verify_mode = ssl.CERT_NONE
+```
+The `curl -s` approach often returns HTML login pages; prefer the Python `urllib` approach with SSL bypass.
+
+## Phase 0 — Get the Latest CI Run Info
+
+First, determine the latest CI run commit, build number, and high-level pass/fail counts:
 
-The CI bot (`tensorrt-cicd`) posts comments with links to the Jenkins build. Extract the `L0_MergeRequest_PR` build number:
 ```bash
+source ~/utils/github/set_github_token.sh
+
 PR_NUM=<pr_number>
-BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --jq \
+
+# Get the latest CI bot comment (contains build number and commit)
+gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
+  '[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body'
+
+# Get the PR HEAD commit and its blossom-ci status (high-level pass/fail counts)
+HEAD_SHA=$(gh api "repos/NVIDIA/TensorRT-LLM/pulls/${PR_NUM}" --jq '.head.sha')
+gh api "repos/NVIDIA/TensorRT-LLM/commits/${HEAD_SHA}/statuses" --jq \
+  '[.[] | select(.context == "blossom-ci")] | first | {state, description}'
+```
+
+The `description` field shows aggregate counts like `"23969 passed, 1 failed, 8962 skipped"`.
+
+## Phase 1 — Get the Jenkins Build Number
+
+Extract the `L0_MergeRequest_PR` build number from the CI bot comment:
+```bash
+BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
   '[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
   | grep -oP 'L0_MergeRequest_PR/\K\d+')
 ```
 
-## Phase 2 — Query the Jenkins testReport API for Failures
+## Phase 1.5 — Check Pipeline Stage Failures (before diving into test details)
 
-Resolve the Jenkins base URL dynamically from the internal shortcut (requires corporate network):
-```bash
-JENKINS_BASE="$(curl -skI 'https://nv/trt-llm-cicd' 2>/dev/null | grep -i '^location:' | sed 's/^[Ll]ocation: *//;s/[[:space:]]*$//')job/main/job/L0_MergeRequest_PR"
+Many CI failures are **infrastructure-level** (Slurm node issues, pipeline aborts, resource exhaustion) where no test code executes at all. Always check the pipeline stages first:
+
+```python
+import json, ssl, urllib.request
+
+ctx = ssl.create_default_context()
+ctx.check_hostname = False
+ctx.verify_mode = ssl.CERT_NONE
+
+JENKINS_BASE = "https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR"
+BUILD_NUM = <build_number>
+
+# Get pipeline stage overview
+url = f"{JENKINS_BASE}/{BUILD_NUM}/wfapi/describe"
+resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
+data = json.loads(resp.read())
+
+print(f"Pipeline status: {data.get('status')}")
+for stage in data.get('stages', []):
+    status = stage.get('status', '')
+    if status not in ('SUCCESS', 'SKIPPED', 'NOT_EXECUTED'):
+        name = stage.get('name', '')
+        print(f"  [{status}] {name}")
+        if 'error' in stage:
+            print(f"    Error: {stage['error']}")
 ```
 
-```bash
-curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
-import json, sys
-data = json.load(sys.stdin)
-print(f'Summary: {data[\"passCount\"]} passed, {data[\"failCount\"]} failed, {data[\"skipCount\"]} skipped')
+## Phase 1.6 — Read Console Log Analysis (Most Valuable for Infrastructure Failures)
+
+The Jenkins console log contains a **CI failure analysis summary** with sections like `## Recommended Actions` and `## Infrastructure Notes`. This is the single most valuable source for understanding infrastructure failures:
+
+```python
+url = f"{JENKINS_BASE}/{BUILD_NUM}/consoleText"
+resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
+text = resp.read().decode('utf-8', errors='replace')
+
+# Extract failure-related lines from the end of the log
+for line in text[-8000:].split('\n'):
+    lo = line.lower()
+    if any(kw in lo for kw in ['fail', 'error', 'abort', 'likely cause',
+                                'recommended action', 'infrastructure',
+                                'no test code', 'stage result']):
+        print(line.strip()[:300])
+```
+
+Key sections to look for in the console log:
+- **`Failing job`** / **`Failed stage`**: which Jenkins sub-job and stage failed
+- **`Likely cause`**: automated root cause analysis (Slurm issues, pipeline timeouts, etc.)
+- **`No test code was executed`**: confirms infrastructure-only failure (no code fix needed)
+- **`Recommended Actions`**: whether to re-trigger CI or investigate code changes
+
+## Phase 2 — Query the Jenkins testReport API for Test Failures
+
+Only proceed here if Phase 1.5/1.6 indicate actual test failures (not infrastructure issues):
+
+```python
+url = f"{JENKINS_BASE}/{BUILD_NUM}/testReport/api/json"
+resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
+data = json.loads(resp.read())
+
+print(f'Summary: {data["passCount"]} passed, {data["failCount"]} failed, {data["skipCount"]} skipped')
+
 failed = []
 for suite in data.get('suites', []):
     for case in suite.get('cases', []):
         if case.get('status') in ('FAILED', 'REGRESSION'):
             failed.append(case)
+
 if not failed:
-    print('No test failures!')
+    print('No test failures in testReport!')
 else:
     print(f'Failed tests ({len(failed)}):')
     for f in failed:
-        print(f'  - {f[\"className\"]}.{f[\"name\"]}')
+        print(f'  - {f["className"]}.{f["name"]}')
         err = (f.get('errorDetails') or '')[:200]
         if err:
             print(f'    Error: {err}')
-"
 ```
 
-## Phase 3 — Get Full stdout/stderr for a Specific Failure
+## Phase 3 — Get Full stdout/stderr for a Specific Test Failure
 
-The `errorStackTrace` can be incomplete when errors originate from subprocesses. In that case, fetch `stdout` and `stderr` for the specific test case to find the real error:
-```bash
-curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
-import json, sys
-data = json.load(sys.stdin)
+The `errorStackTrace` can be incomplete when errors originate from subprocesses. Fetch `stdout` and `stderr` for the specific test case to find the real error:
+```python
 for suite in data.get('suites', []):
     for case in suite.get('cases', []):
         if case.get('status') in ('FAILED', 'REGRESSION'):
-            name = f'{case[\"className\"]}.{case[\"name\"]}'
+            name = f'{case["className"]}.{case["name"]}'
             if '<search_term>' in name:
                 print(f'=== {name} ===')
                 print('--- Error ---')
@@ -71,7 +151,6 @@ for suite in data.get('suites', []):
                 print('--- Stderr (last 3000 chars) ---')
                 print((case.get('stderr') or '')[-3000:])
                 break
-"
 ```
 
 ## Available Fields per Failed Test Case (Jenkins testReport API)
@@ -82,8 +161,20 @@ for suite in data.get('suites', []):
 - `errorStackTrace`: full stack trace (may be incomplete for subprocess errors)
 - `stdout`, `stderr`: full test output (can be large, check these when stack trace is insufficient)
 
+## Common Failure Patterns
+
+| Pattern | Diagnosis | Action |
+|---------|-----------|--------|
+| `No test code was executed` + Slurm errors | Infrastructure: Slurm node resource exhaustion | Re-trigger CI |
+| `ABORTED` stage + `Downstream job did not succeed` | Cascading failure from fail-fast policy | Fix root cause stage, re-trigger |
+| `newosproc` / `errno=11` / `fork/exec` | Kernel process table exhaustion on login node | Wait and re-trigger |
+| `testReport: 0 failed` but `blossom-ci: N failed` | Stage-level failures, not test failures | Check Phase 1.5/1.6 |
+| `testReport: N failed` with real test names | Actual test code failures | Investigate test errors in Phase 3 |
+
 ## Anti-Patterns
 
-- Do not guess Jenkins URLs; always resolve dynamically via the internal shortcut.
+- Do not guess Jenkins URLs; always use the known base `https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR`.
+- Do not use `curl -s` for Jenkins API; it returns HTML login pages. Use Python `urllib` with SSL bypass.
+- Do not jump to testReport (Phase 2) before checking pipeline stages (Phase 1.5) — many failures are infrastructure-only with zero test failures.
 - Do not stop at `errorStackTrace` if it mentions generic wrapper failures like `Process exited with status 1`; check `stdout` and `stderr` for the real error.
 - Do not fetch all test cases when looking for a specific failure; use the `<search_term>` filter in Phase 3.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -211,6 +211,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers
 
 ## TensorRT-LLM LLM Disaggregated
 /examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners
+/examples/disaggregated/slurm/benchmark @NVIDIA/trt-llm-disagg-devs @NVIDIA/trtllm-bench-reviewers
 /tensorrt_llm/disaggregated_params.py @NVIDIA/trt-llm-disagg-devs
 /tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py @NVIDIA/trt-llm-disagg-devs
 /cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp @NVIDIA/trt-llm-disagg-devs