|
| 1 | +--- |
| 2 | +name: ad-model-onboard |
| 3 | +description: Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests. |
| 4 | +--- |
| 5 | + |
| 6 | +# AutoDeploy Model Onboarding |
| 7 | + |
| 8 | +**Input:** HuggingFace model ID. **Output:** prefill-only custom model file + hierarchical tests + summary report. |
| 9 | + |
| 10 | +## Phase 0 — Gather All Resources Upfront |
| 11 | +Web/GitHub fetches require user approval and the user may leave. Do ALL network access now and save locally before proceeding. |
| 12 | + |
| 13 | +**Step 1 — Check local transformers install first:** |
| 14 | +```bash |
| 15 | +python -c "import transformers; print(transformers.__file__)" |
| 16 | +``` |
| 17 | +Look for `models/{model_type}/modeling_*.py` under that path. If found, use it directly — no network needed. |
| 18 | + |
| 19 | +**Step 2 — If not found, download the HF repo (code only, skip weights):** |
| 20 | +```bash |
| 21 | +huggingface-cli download {org}/{model} --exclude "*.safetensors" "*.bin" "*.pt" "*.gguf" |
| 22 | +``` |
| 23 | +This downloads config, code, and tokenizer files into the standard HF cache (`$HF_HOME` or `~/.cache/huggingface/`) while skipping large weight files. Files cached here are automatically found by `transformers.AutoConfig.from_pretrained` and similar APIs — no extra path wiring needed. Once downloaded you can work fully offline — read `config.json` and `modeling_*.py` from the cache snapshot directory printed by the command. |
| 24 | + |
| 25 | +## Phase 1 — Analyze HF Model |
| 26 | +Study the locally-available `config.json` and `modeling_*.py` (NOT from `tensorrt_llm/_torch/models/`). Identify attention type (MHA/GQA/MLA), MoE config, RoPE variant, normalization, activation, and any data-dependent ops that break `torch.export` (e.g. `torch.nonzero`, data-conditioned `if`). |
| 27 | + |
| 28 | +## Phase 2 — Write Prefill-Only Model |
| 29 | +Create `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_{name}.py`. Use `modeling_glm4_moe_lite.py` as a **structural template only** (class layout, dataclass outputs, forward signature). Strip: KV cache, training paths, dropout, flash attention variants. Keep: `PreTrainedModel` hierarchy, `ModelOutput` dataclass, minimal forward `(input_ids, position_ids, inputs_embeds=None, **kwargs)`. |
| 30 | + |
| 31 | +**Critical** |
| 32 | +Make sure the custom modeling code matches the model hierarchy that is expected in the checkpoint safetensor json. |
| 33 | + |
| 34 | +**Critical rule: Do NOT import or reuse existing AD custom model code** (e.g. `from .modeling_deepseek import ...`). Every `modeling_{name}.py` must be self-contained. Use the HF source (`$CLONE_DIR/modeling_*.py`) as the source of truth for the model's logic and translate it fresh — even if a structurally similar AD model already exists. This prevents hidden coupling, makes each model auditable on its own, and ensures model-specific quirks are captured correctly. |
| 35 | + |
| 36 | +## Phase 3 — Use Reference Custom Ops Only |
| 37 | +Replace HF ops with `torch_*` prefixed AD reference ops. **Never** use `triton_*`/`flashinfer_*`/`trtllm_*` — backend selection happens later in AD transforms. Browse `tensorrt_llm/_torch/auto_deploy/custom_ops/` for all available reference ops and their exact signatures. For vanilla components (RMSNorm, MLP), plain PyTorch is also fine — AD fusion passes replace them. |
| 38 | + |
| 39 | +## Phase 4 — Register |
| 40 | +1. Bottom of model file: `AutoModelForCausalLMFactory.register_custom_model_cls("ConfigClassName", ForCausalLM)`. |
| 41 | +2. Add import + `__all__` entry in `models/custom/__init__.py`. |
| 42 | +3. If config not in installed transformers, bundle config class and `AutoConfig.register(model_type, ConfigCls, exist_ok=True)`. |
| 43 | + |
| 44 | +## Phase 5 — Model Input Contract |
| 45 | +The custom model's forward signature must follow these rules: |
| 46 | + |
| 47 | +1. **Always `input_ids`** — The top-level model always receives `input_ids`. A submodule graph may internally receive `inputs_embeds` (e.g., after the embedding layer), but the exported entry point takes token IDs. |
| 48 | +2. **Always `position_ids`** — Vanilla sequential `position_ids` are always provided. If the model uses a non-standard RoPE variant or custom position encoding, the model must compute it internally on top of these vanilla `position_ids`. |
| 49 | +3. **Multi-modal inputs** — If the model supports vision/audio/etc., those additional inputs are passed during prefill alongside `input_ids`. |
| 50 | +4. **No attention mask, no cache inputs, no HF-runtime features** — Do not accept `attention_mask`, `past_key_values`, `use_cache`, or similar HF-runtime arguments. AD manages masking and caching via its own transforms and runtime. |
| 51 | + |
| 52 | +## Phase 6 — Hierarchical Tests |
| 53 | +Create `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_{name}_modeling.py`. Use `test_glm4_moe_lite_modeling.py` as template. **No smoke tests.** Small config (hidden=64, layers=2-3, vocab=1000). Use `pytest.skip` if HF class unavailable. |
| 54 | + |
| 55 | +**HF Reference Strategy:** Equivalence tests compare our custom implementation against the HF reference with identical weights and inputs. |
| 56 | +- **If HF modules exist in the installed `transformers`**: import them directly (e.g., `from transformers.models.deepseek_v3.modeling_deepseek_v3 import DeepseekV3ForCausalLM`). Wrap imports in `_get_hf_*_class()` try/except helpers that return `None` on `ImportError`, and use `pytest.skip` when `None`. |
| 57 | +- **If HF modules are NOT in the installed `transformers`**: copy the minimal module definitions from the HF `modeling_*.py` source into the test file as standalone reference classes. This keeps tests self-contained without requiring a specific `transformers` version. |
| 58 | +- **Weight conversion helpers**: Write test-only helpers for any weight format differences between HF and custom (e.g., RoPE de-interleaving, stacked-to-per-expert MoE weights, gate weight key remapping). For full-model tests, prefer using `load_state_dict` pre-hooks already registered on the custom model. |
| 59 | + |
| 60 | +**Numerical comparison:** For equivalence tests comparing custom ops against HF reference, use the shared `assert_rmse_close` utility from `_model_test_utils`: |
| 61 | +```python |
| 62 | +from _model_test_utils import assert_rmse_close |
| 63 | +``` |
| 64 | +This computes `rmse(actual - expected) / rmse(expected)` — more robust than per-element `torch.testing.assert_close` since a few outlier elements won't fail the test. Use `torch.testing.assert_close` only for blocks with identical math (e.g., plain MLP with no custom ops). |
| 65 | + |
| 66 | +Recommended `rmse_ratio_tol` values for bfloat16: |
| 67 | +- **Identical math** (MLP, Norm): use `torch.testing.assert_close` with tight rtol/atol (1e-3) |
| 68 | +- **MoE block** (fused routing): `0.02` |
| 69 | +- **Decoder layer / MoE layer / full model**: `0.05` |
| 70 | +- **Attention**: `0.10` |
| 71 | + |
| 72 | +**Bottom-up levels (each must pass before next):** |
| 73 | +1. **Block equivalence** — Test MLP, Attention, MoE, Norm individually: same weights + same input → `assert_rmse_close` (or `torch.testing.assert_close` for identical-math blocks). |
| 74 | +2. **Layer equivalence** — Full decoder layer. If model has heterogeneous layers (dense vs MoE, attention vs SSM), test each type separately. |
| 75 | +3. **Full model equivalence** — End-to-end logits comparison. Use a small config with <10 layers that covers the essence of the architecture (e.g., at least one of each layer type). |
| 76 | +4. **Export test** — `torch_export_to_gm` with `Dim.DYNAMIC` for batch+seq, verify finite output, test a second shape. |
| 77 | + |
| 78 | +## Phase 7 — Independent Review (MANDATORY) |
| 79 | + |
| 80 | +Invoke the `ad-onboard-reviewer` subagent with ONLY the following information: |
| 81 | +- Model name |
| 82 | +- Path to the model file created |
| 83 | +- Path to the test file created |
| 84 | + |
| 85 | +**Do NOT include your own assessment of correctness. Do NOT summarize what you did.** Let the reviewer read the files and judge independently. |
| 86 | + |
| 87 | +If the reviewer returns **FAIL** on any item: |
| 88 | +1. Read the reviewer's specific failure reasons and file:line references |
| 89 | +2. Fix each failed item |
| 90 | +3. Invoke the reviewer again with the same minimal inputs |
| 91 | +4. Repeat until you get a full **PASS** |
| 92 | + |
| 93 | +Do NOT proceed to Phase 8 until the reviewer returns PASS. |
| 94 | + |
| 95 | +## Phase 8 — Summary Report |
| 96 | +Print (not file) after completion: (1) model overview + unique features, (2) tricky parts needing human review, (3) files created/modified, (4) test results table (name | validates | PASS/FAIL), (5) known limitations, (6) reviewer result (PASS + how many review iterations it took). |
| 97 | + |
| 98 | +## Key Gotchas |
| 99 | +- **Self-contained files only**: Never import from other AD custom models. Each `modeling_{name}.py` is a standalone translation from HF source. |
| 100 | +- RoPE buffers: `_ad_` prefix, return full table (not sliced), slice by `position_ids` downstream. |
| 101 | +- MoE weights: use `nn.ModuleList` per-expert for checkpoint compatibility. Write test-only state_dict converters for HF stacked format. |
| 102 | +- `noaux_tc` routers (DeepSeek-V3 style): use vanilla PyTorch (sigmoid + bias + group topk + normalize + scale). AD transforms can replace with fused `trtllm` kernels at deployment time. |
| 103 | +- Vision towers are typically **not** exported. Keep vision logic in eager PyTorch and export only the text path unless explicitly requested otherwise. |
| 104 | +- Model code and tests must run on CPU. Use only torch reference ops in AutoDeploy (e.g., `torch_rmsnorm`, `torch_mla`, `torch_moe`) and avoid CUDA-only kernels in the modeling path. |
0 commit comments