xinhe-nv · pull · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/.claude/agents/ad-debug-agent.md b/.claude/agents/ad-debug-agent.md
@@ -0,0 +1,41 @@
+---
+name: ad-debug-agent
+description: Debug the AutoDeploy model onboarding process
+tools: Read, Grep, Glob, Bash, Edit, Write
+model: sonnet
+---
+
+Usually, we run a model with auto deploy using this command. If you are not given the model-id and config, ask the user first.
+
+And ask if you want to rerun it to get fresh log and IR.
+Keep log and IR dump directory $PWD.
+
+Workflow:
+1. Run the ad flow with the user given model-id and yaml using the below command.
+How to run:
+```bash
+AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR> python examples/auto_deploy/build_and_run_ad.py \
+  --model <MODEL_HF_ID> \
+  --args.yaml-extra examples/auto_deploy/model_registry/configs/<CONFIG_YAML_FILE> \
+  2>&1 | tee <LOG_FILE>
+```
+Where `AD_DUMP_GRAPHS_DIR=<AD_DUMP_GRAPHS_DIR>` is the directory where the graphs will be dumped (will be auto-created by the script), `<MODEL_HF_ID>` is the HF model-id of model we want to run (it can also be a local path to a model checkpoint), and `<CONFIG_YAML_FILE>` is the configuration file for the model.
+
+If there's any error, we check the log file `<LOG_FILE>` and IR files in the `AD_DUMP_GRAPHS_DIR` directory to see what went wrong.
+
+2. if you hit an error and notice something wrong, first inform the user what you observed. Then analyze the issue and think of possible rootcause. Don't jump to fixing anything yet.
+
+3. Based on the discussion with the user, implement the fix and run again and iterate.
+
+
+Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write
+
+Some common strategies to iterate faster and debug issues:
+* use less hidden layers - can be done by updating the yaml file with model_kwargs. usually it'll be simple but it needs to match what model config expects - some models might have alternating layer patterns like - 1 full attention, 1 linear attention etc. Then update the yaml file with model_kwargs accordingly.
+* enable / disable sharding - can be done by updating the yaml file with world_size = 1 or world_size >1 (say 2)
+
+Common pit-falls:
+* weights in HF safetensors are not matching what AD custom modeling code expects. So weight loading will fail. Usually there'll be load hooks registered in ad modeling code, but you can verify that. HF safetensors json will be helpful refer.
+* custom model has different module hierarchies than what the checkpoint safetensors expect. In that case we update the ad custom modeling code to match the expected hierarchy.
+
+Remember to use you your own tools - Read, Grep, Glob, Bash, Edit, Write
diff --git a/.claude/agents/ad-onboard-reviewer.md b/.claude/agents/ad-onboard-reviewer.md
@@ -0,0 +1,123 @@
+---
+name: onboard-reviewer
+description: Independent reviewer for AutoDeploy model onboarding. Validates created model and test files against all onboarding requirements. Use after completing model onboarding work.
+tools: Read, Grep, Glob
+model: sonnet
+---
+
+You are an independent code reviewer for AutoDeploy model onboarding.
+
+**Your role is adversarial.** You exist because the implementing agent misses details.
+Do NOT trust any claims from the caller. You will be given a model name and file paths.
+Read every file yourself, line by line, and verify each checklist item with concrete evidence.
+
+## Inputs You Will Receive
+
+- `model_name`: The model being onboarded
+- `model_file`: Path to the created `modeling_*.py`
+- `test_file`: Path to the created `test_*_modeling.py`
+- `init_file`: Always `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`
+
+## Validation Checklist
+
+Read the actual source code for each check. Cite `file:line_number` for every PASS and FAIL.
+
+
+### B. Self-Containment
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| B1 | No imports from other AD custom models (`from .modeling_*`) | Grep for `from .modeling_` — only `from .` imports of non-model utilities are OK (e.g., `mla_rope_utils`) |
+| B2 | Config class is defined in the file OR imported from transformers (not from another AD model) | Check where the config class comes from |
+| B3 | If config not in installed transformers, file has `AutoConfig.register()` | Grep for `AutoConfig.register` |
+
+### BA Checkpoint compatibility
+| BA1 | Make sure the custom modeling code nn.module hierarchy matches the model hierarchy that is expected in the checkpoint safetensor json. |
+| BA2 | If our modeling code has expert-list style moe experts and the checkpoint has fused moe experts, add a load hook to load the safetensors correctly into our expert list weights.
+
+### C. Ops & Compatibility
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| C1 | Only uses `torch_*` reference ops from `auto_deploy.custom_ops` or plain PyTorch | Grep for `torch.ops.` calls — only `torch.ops.auto_deploy.torch_*` allowed |
+| C2 | No `triton_*`, `flashinfer_*`, `trtllm.*` ops (no exception for routers or router gemms all must be CPU compatible torch ops) | Grep for these prefixes |
+| C3 | No KV cache logic (no `past_key_values`, no cache classes) | Grep for `past_key_value`, `cache`, `DynamicCache` |
+| C4 | No training paths (no `self.training` checks, no `dropout`) | Grep for `self.training`, `dropout`, `Dropout` |
+| C5 | No flash attention variants (`flash_attn`, `sdpa`, `_flash_attention`) | Grep for these strings |
+
+### D. RoPE & MoE Conventions
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| D1 | RoPE buffers use `_ad_` prefix (`_ad_cos_cached`, `_ad_sin_cached`) | Grep for `register_buffer` calls with `_ad_` |
+| D2 | RoPE `forward()` returns full table (not sliced by seq_len) | Read the RoPE forward method — should return full cached tensors |
+| D3 | Position slicing happens downstream (in attention, by `position_ids`) | Check attention forward for `cos[position_ids]` or similar pattern |
+| D4 | MoE experts use `nn.ModuleList` (not stacked tensor parameters) | Grep for `nn.ModuleList` in MoE class |
+| D5 | Each expert has individual `gate_proj`, `up_proj`, `down_proj` weights | Check expert structure |
+
+Note: D1-D3 only apply if the model uses RoPE. D4-D5 only apply if the model has MoE.
+Mark as N/A with justification if the model doesn't have the relevant component.
+
+### F. Test File — Structure
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| F1 | Uses small config (hidden_size ~64, num_hidden_layers 2-3, vocab_size ~1000) | Read the test config creation |
+| F2 | No smoke tests — every test has meaningful assertions (`assert_close`, `assert_rmse_close`, shape checks, finiteness checks) | Check each test for substantive assertions |
+| F3 | Do not rely on only `isnan`/`isinf` checks; include functional equivalence assertions | Check tests use `assert_close` or `assert_rmse_close` against reference outputs |
+| F4 | Test imports must be self-contained (transformers imports or copied reference classes only); no hardcoded local/temp path imports | Inspect imports and helper loaders |
+
+### G. Test File — Hierarchical Levels
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| G1 | **Block equivalence**: Tests individual blocks (MLP, Attention, MoE, Norm) comparing AD output vs HF output. Blocks with identical math (plain MLP, Norm) should use `torch.testing.assert_close` with tight tolerance. Blocks with fused custom ops (Attention with MLA/RoPE, MoE with fused routing) must use `assert_rmse_close` from `_model_test_utils` with appropriate `rmse_ratio_tol` (attention: 0.10, MoE: 0.02). | Look for per-block test functions loading same weights into both implementations; verify correct comparison function and tolerance |
+| G2 | **Layer equivalence**: Tests a full decoder layer (if model has heterogeneous layers like dense vs MoE, tests each type). Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. | Look for layer-level test with `assert_rmse_close` |
+| G3 | **Full model equivalence**: End-to-end logits comparison AD vs HF with same weights with minimum number layers. Must use `assert_rmse_close` with `rmse_ratio_tol=0.05`. Also, need to be able to run on CPU. | Look for full model test with logits `assert_rmse_close` |
+| G4 | **Export test**: Uses `torch_export_to_gm` with `Dim.DYNAMIC` for both batch and sequence dimensions | Grep for `torch_export_to_gm` and `Dim.DYNAMIC` |
+| G6 | Export test runs a second forward with different shape to verify dynamic dims work | Look for a second input with different B, S values |
+
+### H. Test File — Weight Conversion
+
+| # | Check | How to verify |
+|---|-------|---------------|
+| H1 | If MoE model: has state_dict converter from HF stacked format to per-expert format | Look for conversion function |
+| H2 | Equivalence tests load identical weights into both HF and AD models before comparing | Check that `load_state_dict` is called with converted weights |
+
+## Output Format
+
+```text
+REVIEW RESULT: PASS | FAIL
+
+=== A. Structure & Hierarchy ===
+A1  PASS  modeling_foo.py:45 — FooPreTrainedModel(PreTrainedModel)
+A2  PASS  modeling_foo.py:30 — @dataclass FooCausalLMOutput(ModelOutput)
+A3  FAIL  modeling_foo.py:120 — forward(self, input_ids, attention_mask, ...) — missing position_ids
+A4  PASS  modeling_foo.py:135 — returns FooCausalLMOutput(logits=logits)
+
+=== B. Self-Containment ===
+B1  PASS  No `from .modeling_` imports found
+B2  PASS  modeling_foo.py:15 — FooConfig defined in file
+B3  PASS  modeling_foo.py:80 — AutoConfig.register("foo", FooConfig, exist_ok=True)
+
+=== C. Ops & Compatibility ===
+...
+
+=== Summary ===
+PASSED: 22/26
+FAILED: 4/26
+
+Failed items requiring fixes:
+1. A3 — Forward signature missing position_ids parameter (modeling_foo.py:120)
+2. G2 — No layer equivalence test found
+3. G4 — Export test missing Dim.DYNAMIC
+4. H1 — No MoE weight converter despite model having MoE layers
+```
+
+## Rules
+
+1. Be strict. If something is ambiguous or borderline, mark it FAIL and explain why.
+2. A PASS result means EVERY SINGLE item passed. Even one FAIL means overall FAIL.
+3. Always cite file:line_number. No exceptions.
+4. Read the actual files. Never infer or assume based on the caller's description.
+5. If a check is not applicable (e.g., D4 for a non-MoE model), mark it N/A with justification.
diff --git a/.claude/skills/ad-model-onboard/SKILL.md b/.claude/skills/ad-model-onboard/SKILL.md
@@ -0,0 +1,104 @@
+---
+name: ad-model-onboard
+description: Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.
+---
+
+# AutoDeploy Model Onboarding
+
+**Input:** HuggingFace model ID. **Output:** prefill-only custom model file + hierarchical tests + summary report.
+
+## Phase 0 — Gather All Resources Upfront
+Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding.
+
+**Step 1 — Check local transformers install first:**
+```bash
+python -c "import transformers; print(transformers.__file__)"
+```
+Look for `models/{model_type}/modeling_*.py` under that path. If found, use it directly — no network needed.
+
+**Step 2 — If not found, download the HF repo (code only, skip weights):**
+```bash
+huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf"
+```
+This downloads config, code, and tokenizer files into the standard HF cache (`$HF_HOME` or `~/.cache/huggingface/`) while skipping large weight files. Files cached here are automatically found by `transformers.AutoConfig.from_pretrained` and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read `config.json` and `modeling_*.py` from the cache snapshot directory printed by the command.
+
+## Phase 1 — Analyze HF Model
+Study the locally-available `config.json` and `modeling_*.py` (NOT from `tensorrt_llm/_torch/models/`). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break `torch.export` (e.g. `torch.nonzero`, data-conditioned `if`).
+
+## Phase 2 — Write Prefill-Only Model
+Create `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py`. Use `modeling_glm4_moe_lite.py` as a **structural template only** (class layout, dataclass outputs, forward signature). Strip: KV cache, training paths, dropout, flash attention variants. Keep: `PreTrainedModel` hierarchy, `ModelOutput` dataclass, minimal forward `(input_ids, position_ids, inputs_embeds=None, **kwargs)`.
+
+**Critical**
+Make sure the custom modeling code matches the model hierarchy that is expected in the checkpoint safetensor json.
+
+**Critical rule: Do NOT import or reuse existing AD custom model code** (e.g. `from .modeling_deepseek import ...`). Every `modeling_{name}.py` must be self-contained. Use the HF source (`$CLONE_DIR/modeling_*.py`) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly.
+
+## Phase 3 — Use Reference Custom Ops Only
+Replace HF ops with `torch_*` prefixed AD reference ops. **Never** use `triton_*`/`flashinfer_*`/`trtllm_*` — backend selection happens later in AD transforms. Browse `tensorrt_llm/_torch/auto_deploy/custom_ops/` for all available reference ops and their exact signatures. For vanilla components (RMSNorm, MLP), plain PyTorch is also fine — AD fusion passes replace them.
+
+## Phase 4 — Register
+1. Bottom of model file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)`.
+2. Add import + `__all__` entry in `models/custom/__init__.py`.
+3. If config not in installed transformers, bundle config class and `AutoConfig.register(model_type, ConfigCls, exist_ok=True)`.
+
+## Phase 5 — Model Input Contract
+The custom model's forward signature must follow these rules:
+
+1. **Always `input_ids`** — The top-level model always receives `input_ids`. A submodule graph may internally receive `inputs_embeds` (e.g., after the embedding layer), but the exported entry point takes token IDs.
+2. **Always `position_ids`** — Vanilla sequential `position_ids` are always provided. If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of these vanilla `position_ids`.
+3. **Multi-modal inputs** — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside `input_ids`.
+4. **No attention mask, no cache inputs, no HF-runtime features** — Do not accept `attention_mask`, `past_key_values`, `use_cache`, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime.
+
+## Phase 6 — Hierarchical Tests
+Create `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py`. Use `test_glm4_moe_lite_modeling.py` as template. **No smoke tests.** Small config (hidden=64, layers=2-3, vocab=1000). Use `pytest.skip` if HF class unavailable.
+
+**HF Reference Strategy:** Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs.
+- **If HF modules exist in the installed `transformers`**: import them directly (e.g., `from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM`). Wrap imports in `_get_hf_*_class()` try/except helpers that return `None` on `ImportError`, and use `pytest.skip` when `None`.
+- **If HF modules are NOT in the installed `transformers`**: copy the minimal module definitions from the HF `modeling_*.py` source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific `transformers` version.
+- **Weight conversion helpers**: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using `load_state_dict` pre-hooks already registered on the custom model.
+
+**Numerical comparison:** For equivalence tests comparing custom ops against HF reference, use the shared `assert_rmse_close` utility from `_model_test_utils`:
+```python
+from _model_test_utils import assert_rmse_close
+```
+This computes `rmse(actual - expected) / rmse(expected)` — more robust than per-element `torch.testing.assert_close` since a few outlier elements won't fail the test. Use `torch.testing.assert_close` only for blocks with identical math (e.g., plain MLP with no custom ops).
+
+Recommended `rmse_ratio_tol` values for bfloat16:
+- **Identical math** (MLP, Norm): use `torch.testing.assert_close` with tight rtol/atol (1e-3)
+- **MoE block** (fused routing): `0.02`
+- **Decoder layer / MoE layer / full model**: `0.05`
+- **Attention**: `0.10`
+
+**Bottom-up levels (each must pass before next):**
+1. **Block equivalence** — Test MLP, Attention, MoE, Norm individually: same weights + same input → `assert_rmse_close` (or `torch.testing.assert_close` for identical-math blocks).
+2. **Layer equivalence** — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately.
+3. **Full model equivalence** — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type).
+4. **Export test** — `torch_export_to_gm` with `Dim.DYNAMIC` for batch+seq, verify finite output, test a second shape.
+
+## Phase 7 — Independent Review (MANDATORY)
+
+Invoke the `ad-onboard-reviewer` subagent with ONLY the following information:
+- Model name
+- Path to the model file created
+- Path to the test file created
+
+**Do NOT include your own assessment of correctness. Do NOT summarize what you did.** Let the reviewer read the files and judge independently.
+
+If the reviewer returns **FAIL** on any item:
+1. Read the reviewer's specific failure reasons and file:line references
+2. Fix each failed item
+3. Invoke the reviewer again with the same minimal inputs
+4. Repeat until you get a full **PASS**
+
+Do NOT proceed to Phase 8 until the reviewer returns PASS.
+
+## Phase 8 — Summary Report
+Print (not file) after completion: (1) model overview + unique features, (2) tricky parts needing human review, (3) files created/modified, (4) test results table (name | validates | PASS/FAIL), (5) known limitations, (6) reviewer result (PASS + how many review iterations it took).
+
+## Key Gotchas
+- **Self-contained files only**: Never import from other AD custom models. Each `modeling_{name}.py` is a standalone translation from HF source.
+- RoPE buffers: `_ad_` prefix, return full table (not sliced), slice by `position_ids` downstream.
+- MoE weights: use `nn.ModuleList` per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format.
+- `noaux_tc` routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused `trtllm` kernels at deployment time.
+- Vision towers are typically **not** exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise.
+- Model code and tests must run on CPU. Use only torch reference ops in AutoDeploy (e.g., `torch_rmsnorm`, `torch_mla`, `torch_moe`) and avoid CUDA-only kernels in the modeling path.
diff --git a/examples/auto_deploy/model_registry/configs/kimi_k2.yaml b/examples/auto_deploy/model_registry/configs/kimi_k2.yaml
@@ -0,0 +1,22 @@
+# Configuration for Kimi-K2.5 VLM (moonshotai/Kimi-K2.5)
+# Uses minimum layers for validation: 1 dense + 2 MoE = 3 total
+runtime: trtllm
+compile_backend: torch-cudagraph
+max_seq_len: 4096
+max_num_tokens: 4096
+max_batch_size: 64
+world_size: 8
+enable_chunked_prefill: true
+cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
+kv_cache_config:
+  dtype: bfloat16
+  enable_block_reuse: false
+  free_gpu_memory_fraction: 0.7
+  tokens_per_block: 64
+model_kwargs:
+  torch_dtype: bfloat16
+transforms:
+  export_to_gm:
+    num_moe_experts_for_export: 2
+  fuse_nvfp4_moe:
+    allow_different_input_scales: true