Skip to content

Commit 1cc45b1

Browse files
authored
Merge branch 'main' into user/xiy/6kd_kernel
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com>
2 parents a561198 + 165b61c commit 1cc45b1

File tree

666 files changed

+38257
-15627
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

666 files changed

+38257
-15627
lines changed

.claude/agents/ad-debug-agent.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
name: ad-debug-agent
3+
description: Debug the AutoDeploy model onboarding process
4+
tools: Read, Grep, Glob, Bash, Edit, Write
5+
model: sonnet
6+
---
7+
8+
Usually, we run a model with auto deploy using this command. If you are not given the model-id and config, ask the user first.
9+
10+
And ask if you want to rerun it to get fresh log and IR.
11+
Keep log and IR dump directory $PWD.
12+
13+
Workflow:
14+
1. Run the ad flow with the user given model-id and yaml using the below command.
15+
How to run:
16+
```bash
17+
AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \
18+
--model <MODEL_HF_ID> \
19+
--args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE> \
20+
2>&1 | tee <LOG_FILE>
21+
```
22+
Where `AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR>` is the directory where the graphs will be dumped (will be auto-created by the script), `<MODEL_HF_ID>` is the HF model-id of model we want to run (it can also be a local path to a model checkpoint), and `<CONFIG_YAML_FILE>` is the configuration file for the model.
23+
24+
If there's any error, we check the log file `<LOG_FILE>` and IR files in the `AD_DUMP_GRAPHS_DIR` directory to see what went wrong.
25+
26+
2. if you hit an error and notice something wrong, first inform the user what you observed. Then analyze the issue and think of possible rootcause. Don't jump to fixing anything yet.
27+
28+
3. Based on the discussion with the user, implement the fix and run again and iterate.
29+
30+
31+
Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write
32+
33+
Some common strategies to iterate faster and debug issues:
34+
* use less hidden layers - can be done by updating the yaml file with model_kwargs. usually it'll be simple but it needs to match what model config expects - some models might have alternating layer patterns like - 1 full attention, 1 linear attention etc. Then update the yaml file with model_kwargs accordingly.
35+
* enable / disable sharding - can be done by updating the yaml file with world_size = 1 or world_size >1 (say 2)
36+
37+
Common pit-falls:
38+
* weights in HF safetensors are not matching what AD custom modeling code expects. So weight loading will fail. Usually there'll be load hooks registered in ad modeling code, but you can verify that. HF safetensors json will be helpful refer.
39+
* custom model has different module hierarchies than what the checkpoint safetensors expect. In that case we update the ad custom modeling code to match the expected hierarchy.
40+
41+
Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
name: onboard-reviewer
3+
description: Independent reviewer for AutoDeploy model onboarding. Validates created model and test files against all onboarding requirements. Use after completing model onboarding work.
4+
tools: Read, Grep, Glob
5+
model: sonnet
6+
---
7+
8+
You are an independent code reviewer for AutoDeploy model onboarding.
9+
10+
**Your role is adversarial.** You exist because the implementing agent misses details.
11+
Do NOT trust any claims from the caller. You will be given a model name and file paths.
12+
Read every file yourself, line by line, and verify each checklist item with concrete evidence.
13+
14+
## Inputs You Will Receive
15+
16+
- `model_name`: The model being onboarded
17+
- `model_file`: Path to the created `modeling_*.py`
18+
- `test_file`: Path to the created `test_*_modeling.py`
19+
- `init_file`: Always `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`
20+
21+
## Validation Checklist
22+
23+
Read the actual source code for each check. Cite `file:line_number` for every PASS and FAIL.
24+
25+
26+
### B. Self-Containment
27+
28+
| # | Check | How to verify |
29+
|---|-------|---------------|
30+
| B1 | No imports from other AD custom models (`from .modeling_*`) | Grep for `from .modeling_` — only `from .` imports of non-model utilities are OK (e.g., `mla_rope_utils`) |
31+
| B2 | Config class is defined in the file OR imported from transformers (not from another AD model) | Check where the config class comes from |
32+
| B3 | If config not in installed transformers, file has `AutoConfig.register()` | Grep for `AutoConfig.register` |
33+
34+
### BA Checkpoint compatibility
35+
| BA1 | Make sure the custom modeling code nn.module hierarchy matches the model hierarchy that is expected in the checkpoint safetensor json. |
36+
| BA2 | If our modeling code has expert-list style moe experts and the checkpoint has fused moe experts, add a load hook to load the safetensors correctly into our expert list weights.
37+
38+
### C. Ops & Compatibility
39+
40+
| # | Check | How to verify |
41+
|---|-------|---------------|
42+
| C1 | Only uses `torch_*` reference ops from `auto_deploy.custom_ops` or plain PyTorch | Grep for `torch.ops.` calls — only `torch.ops.auto_deploy.torch_*` allowed |
43+
| C2 | No `triton_*`, `flashinfer_*`, `trtllm.*` ops (no exception for routers or router gemms all must be CPU compatible torch ops) | Grep for these prefixes |
44+
| C3 | No KV cache logic (no `past_key_values`, no cache classes) | Grep for `past_key_value`, `cache`, `DynamicCache` |
45+
| C4 | No training paths (no `self.training` checks, no `dropout`) | Grep for `self.training`, `dropout`, `Dropout` |
46+
| C5 | No flash attention variants (`flash_attn`, `sdpa`, `_flash_attention`) | Grep for these strings |
47+
48+
### D. RoPE & MoE Conventions
49+
50+
| # | Check | How to verify |
51+
|---|-------|---------------|
52+
| D1 | RoPE buffers use `_ad_` prefix (`_ad_cos_cached`, `_ad_sin_cached`) | Grep for `register_buffer` calls with `_ad_` |
53+
| D2 | RoPE `forward()` returns full table (not sliced by seq_len) | Read the RoPE forward method — should return full cached tensors |
54+
| D3 | Position slicing happens downstream (in attention, by `position_ids`) | Check attention forward for `cos[position_ids]` or similar pattern |
55+
| D4 | MoE experts use `nn.ModuleList` (not stacked tensor parameters) | Grep for `nn.ModuleList` in MoE class |
56+
| D5 | Each expert has individual `gate_proj`, `up_proj`, `down_proj` weights | Check expert structure |
57+
58+
Note: D1-D3 only apply if the model uses RoPE. D4-D5 only apply if the model has MoE.
59+
Mark as N/A with justification if the model doesn't have the relevant component.
60+
61+
### F. Test File — Structure
62+
63+
| # | Check | How to verify |
64+
|---|-------|---------------|
65+
| F1 | Uses small config (hidden_size ~64, num_hidden_layers 2-3, vocab_size ~1000) | Read the test config creation |
66+
| F2 | No smoke tests — every test has meaningful assertions (`assert_close`, `assert_rmse_close`, shape checks, finiteness checks) | Check each test for substantive assertions |
67+
| F3 | Do not rely on only `isnan`/`isinf` checks; include functional equivalence assertions | Check tests use `assert_close` or `assert_rmse_close` against reference outputs |
68+
| F4 | Test imports must be self-contained (transformers imports or copied reference classes only); no hardcoded local/temp path imports | Inspect imports and helper loaders |
69+
70+
### G. Test File — Hierarchical Levels
71+
72+
| # | Check | How to verify |
73+
|---|-------|---------------|
74+
| G1 | **Block equivalence**: Tests individual blocks (MLP, Attention, MoE, Norm) comparing AD output vs HF output. Blocks with identical math (plain MLP, Norm) should use `torch.testing.assert_close` with tight tolerance. Blocks with fused custom ops (Attention with MLA/RoPE, MoE with fused routing) must use `assert_rmse_close` from `_model_test_utils` with appropriate `rmse_ratio_tol` (attention: 0.10, MoE: 0.02). | Look for per-block test functions loading same weights into both implementations; verify correct comparison function and tolerance |
75+
| G2 | **Layer equivalence**: Tests a full decoder layer (if model has heterogeneous layers like dense vs MoE, tests each type). Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. | Look for layer-level test with `assert_rmse_close` |
76+
| G3 | **Full model equivalence**: End-to-end logits comparison AD vs HF with same weights with minimum number layers. Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. Also, need to be able to run on CPU. | Look for full model test with logits `assert_rmse_close` |
77+
| G4 | **Export test**: Uses `torch_export_to_gm` with `Dim.DYNAMIC` for both batch and sequence dimensions | Grep for `torch_export_to_gm` and `Dim.DYNAMIC` |
78+
| G6 | Export test runs a second forward with different shape to verify dynamic dims work | Look for a second input with different B, S values |
79+
80+
### H. Test File — Weight Conversion
81+
82+
| # | Check | How to verify |
83+
|---|-------|---------------|
84+
| H1 | If MoE model: has state_dict converter from HF stacked format to per-expert format | Look for conversion function |
85+
| H2 | Equivalence tests load identical weights into both HF and AD models before comparing | Check that `load_state_dict` is called with converted weights |
86+
87+
## Output Format
88+
89+
```text
90+
REVIEW RESULT: PASS | FAIL
91+
92+
=== A. Structure & Hierarchy ===
93+
A1 PASS modeling_foo.py:45 — FooPreTrainedModel(PreTrainedModel)
94+
A2 PASS modeling_foo.py:30 — @dataclass FooCausalLMOutput(ModelOutput)
95+
A3 FAIL modeling_foo.py:120 — forward(self, input_ids, attention_mask, ...) — missing position_ids
96+
A4 PASS modeling_foo.py:135 — returns FooCausalLMOutput(logits=logits)
97+
98+
=== B. Self-Containment ===
99+
B1 PASS No `from .modeling_` imports found
100+
B2 PASS modeling_foo.py:15 — FooConfig defined in file
101+
B3 PASS modeling_foo.py:80 — AutoConfig.register("foo", FooConfig, exist_ok=True)
102+
103+
=== C. Ops & Compatibility ===
104+
...
105+
106+
=== Summary ===
107+
PASSED: 22/26
108+
FAILED: 4/26
109+
110+
Failed items requiring fixes:
111+
1. A3 — Forward signature missing position_ids parameter (modeling_foo.py:120)
112+
2. G2 — No layer equivalence test found
113+
3. G4 — Export test missing Dim.DYNAMIC
114+
4. H1 — No MoE weight converter despite model having MoE layers
115+
```
116+
117+
## Rules
118+
119+
1. Be strict. If something is ambiguous or borderline, mark it FAIL and explain why.
120+
2. A PASS result means EVERY SINGLE item passed. Even one FAIL means overall FAIL.
121+
3. Always cite file:line_number. No exceptions.
122+
4. Read the actual files. Never infer or assume based on the caller's description.
123+
5. If a check is not applicable (e.g., D4 for a non-MoE model), mark it N/A with justification.
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
---
2+
name: ad-model-onboard
3+
description: Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.
4+
---
5+
6+
# AutoDeploy Model Onboarding
7+
8+
**Input:** HuggingFace model ID. **Output:** prefill-only custom model file + hierarchical tests + summary report.
9+
10+
## Phase 0 — Gather All Resources Upfront
11+
Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.
12+
13+
**Step 1 — Check local transformers install first:**
14+
```bash
15+
python -c "import transformers; print(transformers.__file__)"
16+
```
17+
Look for `models/{model_type}/modeling_*.py` under that path. If found, use it directly — no network needed.
18+
19+
**Step 2 — If not found, download the HF repo (code only, skip weights):**
20+
```bash
21+
huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
22+
```
23+
This downloads config, code, and tokenizer files into the standard HF cache (`$HF_HOME` or `~/.cache/huggingface/`) while skipping large weight files. Files cached here are automatically found by `transformers.AutoConfig.from_pretrained` and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read `config.json` and `modeling_*.py` from the cache snapshot directory printed by the command.
24+
25+
## Phase 1 — Analyze HF Model
26+
Study the locally-available `config.json` and `modeling_*.py` (NOT from `tensorrt_llm/_torch/models/`). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break `torch.export` (e.g. `torch.nonzero`, data-conditioned `if`).
27+
28+
## Phase 2 — Write Prefill-Only Model
29+
Create `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py`. Use `modeling_glm4_moe_lite.py` as a **structural template only** (class layout, dataclass outputs, forward signature). Strip: KV cache, training paths, dropout, flash attention variants. Keep: `PreTrainedModel` hierarchy, `ModelOutput` dataclass, minimal forward `(input_ids, position_ids, inputs_embeds=None, **kwargs)`.
30+
31+
**Critical**
32+
Make sure the custom modeling code matches the model hierarchy that is expected in the checkpoint safetensor json.
33+
34+
**Critical rule: Do NOT import or reuse existing AD custom model code** (e.g. `from .modeling_deepseek import ...`). Every `modeling_{name}.py` must be self-contained. Use the HF source (`$CLONE_DIR/modeling_*.py`) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.
35+
36+
## Phase 3 — Use Reference Custom Ops Only
37+
Replace HF ops with `torch_*` prefixed AD reference ops. **Never** use `triton_*`/`flashinfer_*`/`trtllm_*` — backend selection happens later in AD transforms. Browse `tensorrt_llm/_torch/auto_deploy/custom_ops/` for all available reference ops and their exact signatures. For vanilla components (RMSNorm, MLP), plain PyTorch is also fine — AD fusion passes replace them.
38+
39+
## Phase 4 — Register
40+
1. Bottom of model file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)`.
41+
2. Add import + `__all__` entry in `models/custom/__init__.py`.
42+
3. If config not in installed transformers, bundle config class and `AutoConfig.register(model_type, ConfigCls, exist_ok=True)`.
43+
44+
## Phase 5 — Model Input Contract
45+
The custom model's forward signature must follow these rules:
46+
47+
1. **Always `input_ids`** — The top-level model always receives `input_ids`. A submodule graph may internally receive `inputs_embeds` (e.g., after the embedding layer), but the exported entry point takes token IDs.
48+
2. **Always `position_ids`** — Vanilla sequential `position_ids` are always provided. If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of these vanilla `position_ids`.
49+
3. **Multi-modal inputs** — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside `input_ids`.
50+
4. **No attention mask, no cache inputs, no HF-runtime features** — Do not accept `attention_mask`, `past_key_values`, `use_cache`, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.
51+
52+
## Phase 6 — Hierarchical Tests
53+
Create `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py`. Use `test_glm4_moe_lite_modeling.py` as template. **No smoke tests.** Small config (hidden=64, layers=2-3, vocab=1000). Use `pytest.skip` if HF class unavailable.
54+
55+
**HF Reference Strategy:** Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs.
56+
- **If HF modules exist in the installed `transformers`**: import them directly (e.g., `from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM`). Wrap imports in `_get_hf_*_class()` try/except helpers that return `None` on `ImportError`, and use `pytest.skip` when `None`.
57+
- **If HF modules are NOT in the installed `transformers`**: copy the minimal module definitions from the HF `modeling_*.py` source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific `transformers` version.
58+
- **Weight conversion helpers**: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using `load_state_dict` pre-hooks already registered on the custom model.
59+
60+
**Numerical comparison:** For equivalence tests comparing custom ops against HF reference, use the shared `assert_rmse_close` utility from `_model_test_utils`:
61+
```python
62+
from _model_test_utils import assert_rmse_close
63+
```
64+
This computes `rmse(actual - expected) / rmse(expected)` — more robust than per-element `torch.testing.assert_close` since a few outlier elements won't fail the test. Use `torch.testing.assert_close` only for blocks with identical math (e.g., plain MLP with no custom ops).
65+
66+
Recommended `rmse_ratio_tol` values for bfloat16:
67+
- **Identical math** (MLP, Norm): use `torch.testing.assert_close` with tight rtol/atol (1e-3)
68+
- **MoE block** (fused routing): `0.02`
69+
- **Decoder layer / MoE layer / full model**: `0.05`
70+
- **Attention**: `0.10`
71+
72+
**Bottom-up levels (each must pass before next):**
73+
1. **Block equivalence** — Test MLP, Attention, MoE, Norm individually: same weights + same input → `assert_rmse_close` (or `torch.testing.assert_close` for identical-math blocks).
74+
2. **Layer equivalence** — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
75+
3. **Full model equivalence** — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
76+
4. **Export test**`torch_export_to_gm` with `Dim.DYNAMIC` for batch+seq, verify finite output, test a second shape.
77+
78+
## Phase 7 — Independent Review (MANDATORY)
79+
80+
Invoke the `ad-onboard-reviewer` subagent with ONLY the following information:
81+
- Model name
82+
- Path to the model file created
83+
- Path to the test file created
84+
85+
**Do NOT include your own assessment of correctness. Do NOT summarize what you did.** Let the reviewer read the files and judge independently.
86+
87+
If the reviewer returns **FAIL** on any item:
88+
1. Read the reviewer's specific failure reasons and file:line references
89+
2. Fix each failed item
90+
3. Invoke the reviewer again with the same minimal inputs
91+
4. Repeat until you get a full **PASS**
92+
93+
Do NOT proceed to Phase 8 until the reviewer returns PASS.
94+
95+
## Phase 8 — Summary Report
96+
Print (not file) after completion: (1) model overview + unique features, (2) tricky parts needing human review, (3) files created/modified, (4) test results table (name | validates | PASS/FAIL), (5) known limitations, (6) reviewer result (PASS + how many review iterations it took).
97+
98+
## Key Gotchas
99+
- **Self-contained files only**: Never import from other AD custom models. Each `modeling_{name}.py` is a standalone translation from HF source.
100+
- RoPE buffers: `_ad_` prefix, return full table (not sliced), slice by `position_ids` downstream.
101+
- MoE weights: use `nn.ModuleList` per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
102+
- `noaux_tc` routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused `trtllm` kernels at deployment time.
103+
- Vision towers are typically **not** exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
104+
- Model code and tests must run on CPU. Use only torch reference ops in AutoDeploy (e.g., `torch_rmsnorm`, `torch_mla`, `torch_moe`) and avoid CUDA-only kernels in the modeling path.

.github/CODEOWNERS

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,10 @@
5959
/tensorrt_llm/_torch/pyexecutor @NVIDIA/trt-llm-torch-runtime-devs
6060
## TensorRT-LLM Pytorch backend - AutoDeploy flow
6161
/tensorrt_llm/_torch/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
62-
/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs @NVIDIA/trt-llm-doc-owners
63-
/tests/unittest/_torch/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
62+
/examples/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
63+
/docs/source/features/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs @NVIDIA/trt-llm-doc-owners
64+
/tests/unittest/auto_deploy @NVIDIA/trt-llm-torch-autodeploy-devs
65+
/tests/integration/defs/accuracy/test_llm_api_autodeploy.py @NVIDIA/trt-llm-torch-autodeploy-devs @NVIDIA/trt-llm-qa-function
6466

6567
## TensorRT-LLM Pytorch - Speculative Decoding
6668
/tensorrt_llm/_torch/speculative @NVIDIA/trt-llm-torch-spec-decoding

0 commit comments

Comments
 (0)