Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions .claude/skills/ad-model-onboard/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,143 @@ GH_CONFIG_DIR=<path> gh pr view <PR_NUMBER> --json reviews,state

**Do NOT stop polling prematurely.** The loop must continue until the PR is approved or a clear termination signal is received. If polling has been running for an extended period (e.g., >2 hours) with no new activity, inform the user that you are still monitoring and ask if they want you to continue or stop.

## Sharding-aware IR model porting (`modeling_*_ir.py`)

Use this when porting an existing AutoDeploy custom model (`tensorrt_llm/_torch/auto_deploy/models/custom/modeling_*.py`) to explicit sharding hint ops in `modeling_*_ir.py` **in the same directory** (no separate `new_sharding/` tree). The exported FX graph must fully specify how the model should be sharded: the `apply_sharding_hints` transform combines hints with a runtime `DistConfig` for deterministic, node-local sharding.

**Argument reference:** Do not duplicate operator tables here. Refer to the custom op docstrings in `tensorrt_llm/_torch/auto_deploy/custom_ops/` for the complete argument reference (including sharding hints, `tp_mode`, `layer_type`, and which ops accept hints).

### Reference examples (study before porting)

| Original | IR / sharding-aware | Layer types |
|----------|---------------------|-------------|
| `modeling_nemotron_h.py` | `modeling_nemotron_h_ir.py` | Mamba SSM, MHA, SwiGLU MLP, MoE |
| `modeling_qwen3_5_moe.py` | `modeling_qwen3_5_moe_ir.py` | GatedDeltaNet, Gated MHA, SwiGLU MLP, MoE |
| `modeling_mistral.py` | `modeling_mistral_ir.py` | MHA, SwiGLU MLP (simplest) |
| `modeling_deepseek_v2.py` | `modeling_deepseek_v2_ir.py` | MLA, SwiGLU MLP, MoE |

### Step-by-step porting procedure

#### Step 1: Copy the source file

```bash
cp tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo.py \
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_foo_ir.py
```

#### Step 2: Update the module docstring and add imports

At the top of the IR file:

```python
import tensorrt_llm._torch.auto_deploy.custom_ops # noqa: F401 -- register all ops
```

Do **not** add global `SHARD_*` flags. Layer-level control uses the `layer_type` hint on each op and `shard_layers` in YAML.

#### Step 3: Replace linear projections

For every `self.proj(x)` or `nn.Linear` call, use `torch.ops.auto_deploy.torch_linear_simple` with explicit `tp_mode` and `layer_type`. Always set `tp_mode` unconditionally (no `if _s else "none"`). **Rules:** opening projections (Q/K/V/gate/up/in_proj) → `"colwise"`; closing (O/down/out_proj) → `"rowwise"`; tiny outputs (e.g. `shared_expert_gate` dim 1) → `"none"`; MLA latent projections (q_a, kv_a) → `"none"`. For fused weights split later, pass `output_sizes=[...]`. For GQA, use `tp_min_local_shape=self.head_dim` on K/V colwise lines.

#### Step 4: Replace split / chunk after fused colwise projections

Use `torch.ops.auto_deploy.split_with_sizes` with `shardable` / `layer_type` where sizes scale with TP.

#### Step 5: Replace view / reshape with concrete head counts

During `torch.export`, `-1` becomes concrete; after TP, wrong values break. Any reshape whose dimension is a head count that scales with TP must use `torch.ops.auto_deploy.view` with `tp_scaled_dim` set appropriately. Safe cases: flat-to-2D, or `[B,S,-1]` when the input is already correctly sharded.

#### Step 6: Insert `all_reduce`

After every rowwise projection, add `torch.ops.auto_deploy.all_reduce(..., layer_type=...)`. **Parallel branch rule:** when branches merge by addition, use a **single** `all_reduce` after the sum (e.g. MoE routed + shared expert; parallel attention + MLP residual branches).

#### Step 7: Special ops (Conv1d, SSM, GatedDeltaNet, gated RMSNorm)

Add sharding hints on `torch_causal_conv1d`, `torch_ssm`, `torch_gated_delta_rule`, `torch_rmsnorm_gated` per docstrings—typically `shardable` / `output_sizes` / `tp_mode` as required.

#### Step 8: MoE

Pass `layer_type="moe"` into `torch_moe`; `apply_sharding_hints` handles EP/TP.

#### Step 9: Register the IR model

1. Bottom of the IR file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)` (same pattern as Phase 4).
2. Add a **side-effect import** in `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py` (e.g. `from . import modeling_foo_ir # noqa: F401`) and extend `__all__` if you export symbols. Without this import, worker processes may not load your class and `apply_sharding_hints` can report **0 nodes processed**. Do **not** use a separate `register_sharded_models.py` indirection.

#### Step 10: YAML — composable registry pattern

Prefer the model registry (`examples/auto_deploy/model_registry/models.yaml`) and **compose** shared fragments under `examples/auto_deploy/model_registry/configs/`, same as other models: list `dashboard_default.yaml`, the right `world_size_N.yaml`, then a dedicated fragment (e.g. `enable_sharder_ir.yaml`) that holds IR sharding transforms. That fragment should disable legacy sharding passes and enable hint-driven sharding. Registry fragments are deep-merged in `yaml_extra` order (see `DynamicYamlMixInForSettings` in `tensorrt_llm/_torch/auto_deploy/utils/_config.py`); place transform keys under `transforms:` so they merge with `dashboard_default.yaml`. Standalone experiment YAMLs for `build_and_run_ad` may wrap the same fields under a top-level `args:` block matching `LlmArgs`.

Example transform block:

```yaml
# Typical contents for enable_sharder_ir.yaml (registry composable fragment)
transforms:
export_to_gm:
num_moe_experts_for_export: 2 # often required when expert count is large (>64)
detect_sharding:
stage: sharding
enabled: false
sharding_transform_executor:
stage: sharding
enabled: false
apply_sharding_hints:
stage: sharding
enabled: true
run_shape_prop: true
allreduce_strategy: NCCL
# shard_layers: ['mha', 'mlp'] # optional selective sharding
gather_logits_before_lm_head:
enabled: true
```

Use `world_size: 8` when validating TP head-divisibility. Optional `shard_layers` limits which `layer_type` hints are processed; unset means shard all shardable nodes.

#### Step 11: Validate

Do not report success until a run completes successfully.

1. Prefer `python examples/auto_deploy/build_and_run_ad.py --model <MODEL-ID> --use-registry` after adding/updating the registry entry and composable YAMLs (Phase 8–9 style).
2. `apply_sharding_hints` logs should show **`N nodes processed` with N > 0**.
3. If validation fails with infrastructure limits (e.g. head count not divisible by `world_size`), document the assert and compatible sizes; do not “fix” core `sharding.py` / custom op schemas without owner review.
4. If blocked by missing infrastructure support, rename artifacts to `broken_modeling_*_ir.py` / broken YAML and file a short error report for humans (do not silently patch core transforms).

**Layer type strings** (for `layer_type` / `shard_layers`): use `"mha"`, `"mla"`, `"mlp"`, `"moe"`, `"ssm"`, `"delta"`, or `"unknown"` (default; skipped when `shard_layers` is set). Match the conventions used in `apply_sharding_hints` and project enums.

### Layer-specific sharding patterns

**MHA (standard or gated):** `layer_type="mha"`: q/k/v colwise (GQA: `tp_min_local_shape`), `view` with `tp_scaled_dim` for head dim, o rowwise + `all_reduce`. Fused Q+gate interleaved per head: colwise without `output_sizes`; contiguous Q|K|V fused blocks need `output_sizes`.

**SwiGLU MLP:** `layer_type="mlp"`: gate/up colwise, down rowwise + `all_reduce`.

**Mamba / SSM:** `layer_type="ssm"`: in_proj colwise + `output_sizes`, splits shardable, conv1d shardable + `output_sizes`, views, `torch_ssm` shardable, norm gated colwise if weight scales, out rowwise + `all_reduce`.

**GatedDeltaNet:** `layer_type="delta"`: in_proj_qkv with `output_sizes`, other in_projs colwise, conv1d/splits/views as above, `torch_gated_delta_rule` shardable, out rowwise + `all_reduce`.

**MoE + shared expert:** `layer_type="moe"`: router replicated; one `all_reduce` after `routed + shared`, not two.

**MLA (DeepSeek):** `layer_type="mla"`: keep `torch_mla` intact with `shardable=True`—do **not** decompose into separate linears + `torch_attention` (introduces bad `expand`/`view` with concrete head counts). q_a/kv_a latent: `tp_mode="none"`; q_b colwise; `o_proj` rowwise + `all_reduce`.

### Common pitfalls (sharding IR)

1. **Missing `auto_deploy::view` for head reshapes** — concrete shapes from export break after sharding.
2. **Sharding tiny projections** — dim-1 gates: `tp_mode="none"`.
3. **Double `all_reduce` in MoE** — one merge-point reduction for routed + shared.
4. **Cross-layer parameter contamination** — in `_apply_hint_*` handlers using `get_source_nodes()`, restrict with `allowed_ops` so residual links do not pull weights from other layers.
5. **Missing `num_moe_experts_for_export`** for very large expert counts — export can hang.
6. **Decomposing ops that absorb weights** (e.g. `torch_mla`) — use `shardable` + handler instead of splitting into plain linears.
7. **Interleaved vs contiguous fused weights** — interleaved per-head groups: colwise only; contiguous Q|K|V blocks: require `output_sizes`.
8. **Omitting `layer_type` when using `shard_layers`** — `"unknown"` nodes are skipped; set hints explicitly on sharding-aware ops.
9. **`layer_type` on non-hint ops** — do **not** pass `layer_type` to ops that are not designed for sharding hints (e.g. `torch_attention`, `torch_l2norm`, `torch_rope_*`); extra positional args break calls. Confirm in `custom_ops/` docstrings which ops accept hints.
10. **Conditional hint values** — no `if _s else "none"`; use unconditional hints and rely on `shard_layers` / transform config.

### Sharding IR validation checklist (human review)

- `world_size=1`: unsharded path; hints should not break correctness.
- `world_size=2` and `8`: shape checks and coherent output.
- `apply_sharding_hints` node count vs expectation.
- Optional: `shard_layers: ['moe']` to verify selective sharding.

## Key Gotchas
- **Canonical ops first:** Always use `torch.ops.auto_deploy.torch_*` canonical ops whenever one exists for the operation. This is how AD knows what to optimize. Writing manual attention, MoE, RoPE, or normalization in plain PyTorch instead of using the canonical op will prevent AD transforms from working.
- **No `repeat_interleave`:** AD attention ops handle GQA natively. Never repeat K/V heads manually.
Expand Down
139 changes: 115 additions & 24 deletions .claude/skills/ci-failure-retrieval/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,56 +10,136 @@ metadata:

**Input:** a PR number or a request to check CI failures. **Auth requirement:** requires corporate network access to resolve the Jenkins base URL. **Output:** a summary of failed tests with error details, and optionally full stdout/stderr for specific failures.

## Phase 1 — Get the Jenkins Build Number
## Important: SSL and Authentication

Jenkins requires SSL with certificate verification disabled. Always use `ssl` context bypass in Python or `-sk` flags in curl:
```python
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
```
The `curl -s` approach often returns HTML login pages; prefer the Python `urllib` approach with SSL bypass.

## Phase 0 — Get the Latest CI Run Info

First, determine the latest CI run commit, build number, and high-level pass/fail counts:

The CI bot (`tensorrt-cicd`) posts comments with links to the Jenkins build. Extract the `L0_MergeRequest_PR` build number:
```bash
source ~/utils/github/set_github_token.sh

PR_NUM=<pr_number>
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --jq \

# Get the latest CI bot comment (contains build number and commit)
gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
'[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body'

# Get the PR HEAD commit and its blossom-ci status (high-level pass/fail counts)
HEAD_SHA=$(gh api "repos/NVIDIA/TensorRT-LLM/pulls/${PR_NUM}" --jq '.head.sha')
gh api "repos/NVIDIA/TensorRT-LLM/commits/${HEAD_SHA}/statuses" --jq \
'[.[] | select(.context == "blossom-ci")] | first | {state, description}'
```

The `description` field shows aggregate counts like `"23969 passed, 1 failed, 8962 skipped"`.

## Phase 1 — Get the Jenkins Build Number

Extract the `L0_MergeRequest_PR` build number from the CI bot comment:
```bash
BUILD_NUM=$(gh api "repos/NVIDIA/TensorRT-LLM/issues/${PR_NUM}/comments" --paginate --jq \
'[.[] | select(.user.login == "tensorrt-cicd") | select(.body | test("L0_MergeRequest_PR"))] | last | .body' \
| grep -oP 'L0_MergeRequest_PR/\K\d+')
```

## Phase 2Query the Jenkins testReport API for Failures
## Phase 1.5Check Pipeline Stage Failures (before diving into test details)

Resolve the Jenkins base URL dynamically from the internal shortcut (requires corporate network):
```bash
JENKINS_BASE="$(curl -skI 'https://nv/trt-llm-cicd' 2>/dev/null | grep -i '^location:' | sed 's/^[Ll]ocation: *//;s/[[:space:]]*$//')job/main/job/L0_MergeRequest_PR"
Many CI failures are **infrastructure-level** (Slurm node issues, pipeline aborts, resource exhaustion) where no test code executes at all. Always check the pipeline stages first:

```python
import json, ssl, urllib.request

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

JENKINS_BASE = "https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR"
BUILD_NUM = <build_number>

# Get pipeline stage overview
url = f"{JENKINS_BASE}/{BUILD_NUM}/wfapi/describe"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())

print(f"Pipeline status: {data.get('status')}")
for stage in data.get('stages', []):
status = stage.get('status', '')
if status not in ('SUCCESS', 'SKIPPED', 'NOT_EXECUTED'):
name = stage.get('name', '')
print(f" [{status}] {name}")
if 'error' in stage:
print(f" Error: {stage['error']}")
```

```bash
curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
print(f'Summary: {data[\"passCount\"]} passed, {data[\"failCount\"]} failed, {data[\"skipCount\"]} skipped')
## Phase 1.6 — Read Console Log Analysis (Most Valuable for Infrastructure Failures)

The Jenkins console log contains a **CI failure analysis summary** with sections like `## Recommended Actions` and `## Infrastructure Notes`. This is the single most valuable source for understanding infrastructure failures:

```python
url = f"{JENKINS_BASE}/{BUILD_NUM}/consoleText"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
text = resp.read().decode('utf-8', errors='replace')

# Extract failure-related lines from the end of the log
for line in text[-8000:].split('\n'):
lo = line.lower()
if any(kw in lo for kw in ['fail', 'error', 'abort', 'likely cause',
'recommended action', 'infrastructure',
'no test code', 'stage result']):
print(line.strip()[:300])
```

Key sections to look for in the console log:
- **`Failing job`** / **`Failed stage`**: which Jenkins sub-job and stage failed
- **`Likely cause`**: automated root cause analysis (Slurm issues, pipeline timeouts, etc.)
- **`No test code was executed`**: confirms infrastructure-only failure (no code fix needed)
- **`Recommended Actions`**: whether to re-trigger CI or investigate code changes

## Phase 2 — Query the Jenkins testReport API for Test Failures

Only proceed here if Phase 1.5/1.6 indicate actual test failures (not infrastructure issues):

```python
url = f"{JENKINS_BASE}/{BUILD_NUM}/testReport/api/json"
resp = urllib.request.urlopen(urllib.request.Request(url), context=ctx, timeout=30)
data = json.loads(resp.read())

print(f'Summary: {data["passCount"]} passed, {data["failCount"]} failed, {data["skipCount"]} skipped')

failed = []
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
failed.append(case)

if not failed:
print('No test failures!')
print('No test failures in testReport!')
else:
print(f'Failed tests ({len(failed)}):')
for f in failed:
print(f' - {f[\"className\"]}.{f[\"name\"]}')
print(f' - {f["className"]}.{f["name"]}')
err = (f.get('errorDetails') or '')[:200]
if err:
print(f' Error: {err}')
"
```

## Phase 3 — Get Full stdout/stderr for a Specific Failure
## Phase 3 — Get Full stdout/stderr for a Specific Test Failure

The `errorStackTrace` can be incomplete when errors originate from subprocesses. In that case, fetch `stdout` and `stderr` for the specific test case to find the real error:
```bash
curl -s "${JENKINS_BASE}/${BUILD_NUM}/testReport/api/json" | python3 -c "
import json, sys
data = json.load(sys.stdin)
The `errorStackTrace` can be incomplete when errors originate from subprocesses. Fetch `stdout` and `stderr` for the specific test case to find the real error:
```python
for suite in data.get('suites', []):
for case in suite.get('cases', []):
if case.get('status') in ('FAILED', 'REGRESSION'):
name = f'{case[\"className\"]}.{case[\"name\"]}'
name = f'{case["className"]}.{case["name"]}'
if '<search_term>' in name:
print(f'=== {name} ===')
print('--- Error ---')
Expand All @@ -71,7 +151,6 @@ for suite in data.get('suites', []):
print('--- Stderr (last 3000 chars) ---')
print((case.get('stderr') or '')[-3000:])
break
"
```

## Available Fields per Failed Test Case (Jenkins testReport API)
Expand All @@ -82,8 +161,20 @@ for suite in data.get('suites', []):
- `errorStackTrace`: full stack trace (may be incomplete for subprocess errors)
- `stdout`, `stderr`: full test output (can be large, check these when stack trace is insufficient)

## Common Failure Patterns

| Pattern | Diagnosis | Action |
|---------|-----------|--------|
| `No test code was executed` + Slurm errors | Infrastructure: Slurm node resource exhaustion | Re-trigger CI |
| `ABORTED` stage + `Downstream job did not succeed` | Cascading failure from fail-fast policy | Fix root cause stage, re-trigger |
| `newosproc` / `errno=11` / `fork/exec` | Kernel process table exhaustion on login node | Wait and re-trigger |
| `testReport: 0 failed` but `blossom-ci: N failed` | Stage-level failures, not test failures | Check Phase 1.5/1.6 |
| `testReport: N failed` with real test names | Actual test code failures | Investigate test errors in Phase 3 |

## Anti-Patterns

- Do not guess Jenkins URLs; always resolve dynamically via the internal shortcut.
- Do not guess Jenkins URLs; always use the known base `https://prod.blsm.nvidia.com/sw-tensorrt-top-1/job/LLM/job/main/job/L0_MergeRequest_PR`.
- Do not use `curl -s` for Jenkins API; it returns HTML login pages. Use Python `urllib` with SSL bypass.
- Do not jump to testReport (Phase 2) before checking pipeline stages (Phase 1.5) — many failures are infrastructure-only with zero test failures.
- Do not stop at `errorStackTrace` if it mentions generic wrapper failures like `Process exited with status 1`; check `stdout` and `stderr` for the real error.
- Do not fetch all test cases when looking for a specific failure; use the `<search_term>` filter in Phase 3.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ docs/source/performance/perf-benchmarking.md @NVIDIA/trtllm-bench-reviewers

## TensorRT-LLM LLM Disaggregated
/examples/disaggregated @NVIDIA/trt-llm-disagg-devs @NVIDIA/trt-llm-doc-owners
/examples/disaggregated/slurm/benchmark @NVIDIA/trt-llm-disagg-devs @NVIDIA/trtllm-bench-reviewers
/tensorrt_llm/disaggregated_params.py @NVIDIA/trt-llm-disagg-devs
/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py @NVIDIA/trt-llm-disagg-devs
/cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp @NVIDIA/trt-llm-disagg-devs
Expand Down
Loading
Loading