Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
175 changes: 145 additions & 30 deletions CODING_GUIDELINES.md
Original file line number Diff line number Diff line change
Expand Up @@ -345,33 +345,22 @@ char const * const errStr = getErrorStr(status);
----

## Python Coding Guidelines
Code should adhere to [PEP 8](https://peps.python.org/pep-0008/#fn-hi), unless otherwise noted.

#### Python Standard
1. The code developed for TensorRT-LLM should conform to Python 3.8+.

#### Indentation
1. Indent code with 4 spaces. Do not use tabs.

#### Imports
1. Always maintain the namespace when importing, even if only one class or function from a module is used.

For example instead of:
#### Formatting

```python
from package.subpackage.foo import SomeClass
SomeClass()
```
or
```python
import package
package.subpackage.foo.SomeClass()
```
1. Indent code with 4 spaces. Do not use tabs.
2. Code formatting is largely handled by the automatic tooling. Do not override it unless it substantially improves readability.
3. Note we have "legacy" files and "new" files that are formatted by different toolchains, see <pyproject.toml>. This results in somewhat different formatting between the two classes of files. Most notably legacy files are 80 characters wide while new files are 100.

Do:

```python
from package.subpackage import foo
foo.SomeClass()
```
#### Imports
1. The linter will have opinions on import ordering. Please follow them.
2. Do not use wildcard imports.
3. Despite the prohibition on wildcard imports, keep `__all__` updated to keep the public interface clearly documented.

#### Naming

Expand All @@ -385,26 +374,29 @@ foo.SomeClass()
3. Functions and Methods
- snake_case: `def my_awesome_function():`

4. Local Variables
4. Local Variables or Mutable Global Variables
- snake_case: `my_variable = ...`
- prefix `k` for variable names that start with a number: `k_99th_percentile = ...`
- Single-letter variables may also be uppercase, e.g. `N`, `T`.
- Variables should not start with a number, but if you must, prefix with `k`, e.g. `k_99th_percentile = ...`

5. Global Variables
- upper snake_case and prefix `G`: `G_MY_GLOBAL = ...`
5. Constants (any scope)
- UPPER\_SNAKE\_CASE: `MY_CONSTANT = ...`

6. Constants
- upper snake_case: `MY_CONSTANT = ...`
Variables and functions not part of a class’s or module’s public interface should be prefixed with an underscore. Double underscores are permitted only if necessary to avoid name conflicts with inherited classes, and even then you should pursue alternatives.

##### Identifier Guidelines
1. Avoid shadowing variables declared in an outer scope.
2. Initialize all externally visible memberes of a class in the constructor.
2. Initialize all externally visible members of a class in the constructor.
3. For variables referencing “container” type objects that could live explicitly on the host or a GPU, e.g. referencing a Tensor, consider appending `_host` or `_device`/`_cuda` suffixes if the location is ambiguous. Particularly if copies of the data exist in both locations.

#### Comments

1. For interfaces that may be used outside a file, prefer docstrings over comments.
2. Comments should be reserved for code within a function, or interfaces that are local to a file.
3. Avoid overcommenting. Reserve comments for things that need explaining, or breaking up long sections of code into functional parts. But in that case, consider helper functions.
4. For arguments to functions in the public interface to a file, documentation of Tensor-like arguments should include the expected dimensions, e.g. `[batch, seq_len, hdim]`, and the allowed dtype options if dtype is constrained.

### Pydantic Guidelines
#### Pydantic Guidelines

When defining any user-facing configuration classes (particularly `LlmArgs` or any class used in its fields), **always** use Pydantic classes rather than dataclasses or vanilla classes.

Expand Down Expand Up @@ -445,7 +437,7 @@ When defining any user-facing configuration classes (particularly `LlmArgs` or a
##### Classes and Functions
Use the [Google style](https://google.github.io/styleguide/pyguide.html), which can be parsed by Sphinx.

##### Attributes and Variables
##### Attributes and Variables
Attributes and variables can be documented inline. Attribute docstrings will be rendered under the docstring for the class. For example:
```python
class MyClass:
Expand All @@ -460,6 +452,9 @@ y = 2
"""<type>: Description of 'y'"""
```

However, attribute docstrings are relatively rare and not expected. Externally called functions should have docstrings, and their arguments should be documented. Class initializer arguments especially should be documented.


#### Avoid Reflection
Avoid using reflection when functionality can be easily achieved without reflection.

Expand Down Expand Up @@ -524,6 +519,126 @@ else:
f.read()
```

Except in exceptional circumstances, use the built-in exception types. For which type to use when, see [https://docs.python.org/3/library/exceptions.html](https://docs.python.org/3/library/exceptions.html). Use exceptions for error handling, not return values. And despite the example above, prefer isinstance() to duck typing where possible.

#### Static Typing

1. Static type checking at pre-commit time is opt-in by submodule PICs. This is highly recommended because static type checking eliminates an entire class of bugs and makes your code more readable and maintainable overall.
2. The presubmit system currently uses mypy. However, many developers use pyright variants in their editors, so the code also has some `#pyright:` annotations. As we don’t currently enforce pyright, maintaining these is best effort. But if you notice they are broken, please fix them.
3. Do not use `typing.Any` if you can avoid it. Similarly, avoid bypassing the type checker with `# type: ignore` annotations.
4. Always annotate functions. Make the return type `None` if the function does not return anything (if you leave it empty, the type checker will infer the return type as `Any`).
5. Annotate class members and other variables when necessary. Always annotate `dataclass` and `NamedTuple` members.

```py
class Foo:
def __init__(self, x: int) -> None:
self.x = x # inferred as int, no extra annotation required
self.y: Optional[int] = None # annotation required to prevent NoneType from being inferred
```

6. Prefer using the built-in types `list`, `dict`, and `tuple` to the legacy `typing.List`, `typing.Dict`, and `typing.Tuple`. Similarly, use the `|` syntax instead of `typing.Union`.

```py
# Instead of
def foo(x: List[int], y: Union[int, float]) -> None:
pass

# Do:
def foo(x: list[int], y: int | float) -> None:
pass
```

7. Prefer specifying argument types in `Callable`s.

```py
# Type checks, but not the best style
def foo(c: Callable[..., int]) -> None:
c(42)

# Best practice.
def foo(c: Callable[[int], int]) -> None:
c(42)
```

8. Don’t annotate variables where it is obvious/not necessary.

```py
x: int = 42 # Not required
```

9. Prefer `Literal` to `str` when a fixed set of values is expected.

```py
# Works:
def f(backend: str = "pytorch") -> None: pass

# But this is preferred:
def f(backend: Literal["pytorch", "tensorrt"] = "pytorch") -> None: pass
```

10. Use `@overload` when a return type depends on an input type. If the return type can be expressed using the input type, you can alternatively use a `TypeVar`.

```py
@overload
def foo(a: str) -> int:
pass

@overload
def foo(a: float) -> float:
pass

def foo(a: str | float) -> int | float:
if isinstance(a, str):
return 42
return 42.0

def bar(a: float) -> None: pass

bar(foo(1.0)) # This will type check thanks to @overload

# In this example, the return type can be expressed as
T = TypeVar("T")
def baz(x: T) -> dict[str, T]:
return {"key": x}
```

11. Use a bounded TypeVar only when the type parameter appears in both input and return positions to preserve specific type information; if it appears only in the parameters, use the bound type directly.

```py
class Foo:
def f(self) -> None: pass

class Bar(Foo): pass

# Instead of:
# T = TypeVar("T", bound=Foo)
# def func(x: T) -> None:
# x.f()

# We can just do:
def func(x: Foo) -> None:
x.f()

# Here, using a bound type var is actually useful. We prevent
# func2 from losing type information.
# def func2(x: Foo) -> Foo:
# return x
# x = func2(Bar()) # Return type is Foo

T = TypeVar("T", bound=Foo)
def func2(x: T) -> T:
return x
x = func2(Bar()) # Return type is Bar
```

12. Use typing.Protocol for duck typing. Prefer it when
* You need an interface that third-party or unrelated classes can satisfy without inheriting from a base class.
* You want to type-check that an object has specific methods/attributes without coupling to a class hierarchy.

Do not use Protocol when a shared base class or ABC already exists and implementations naturally inherit from it — use the ABC directly. Also do not use it when you only need a union of concrete types — use Union or a type alias instead.

Note that TypeVars can also be bound to `Protocol`s. Use this feature to specify the expected interface for an argument to a generic function if duck typing is desired.

## Documentation Guidelines

#### CLI Options in Documentation
Expand Down
23 changes: 21 additions & 2 deletions examples/auto_deploy/model_registry/configs/qwen3.5_moe_35b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@ attn_backend: trtllm
max_seq_len: 8192
max_num_tokens: 4096
max_batch_size: 512
world_size: 4
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
enable_chunked_prefill: true
model_factory: AutoModelForCausalLM
model_factory: Qwen3_5MoeForConditionalGeneration
kv_cache_config:
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
tokens_per_block: 32
model_kwargs:
torch_dtype: bfloat16
transforms:
initialize_mrope_delta_cache:
enabled: true
export_to_gm:
num_moe_experts_for_export: 2
fuse_gemms_mixed_children:
Expand All @@ -23,6 +24,24 @@ transforms:
allreduce_strategy: SYMM_MEM
shard_all_unprocessed: true
simple_shard_filter: "lm_head"
sharding_dims: ['tp','ep', 'bmm']
# use only manual config for TP sharding
sharding_source: ['manual']
manual_config:
tp_plan:
# GDN layer
"in_proj_qkv": "delta"
# attention layer
"q_proj": "colwise"
"k_proj": "colwise"
"v_proj": "colwise"
"o_proj": "rowwise"
# replicating shared experts (keep them commented out)
# "shared_expert_gate_proj": "colwise"
# "shared_expert_up_proj": "colwise"
# "shared_expert_down_proj": "rowwise"
# gating layer should be replicated as well
# "gate": "gather"
multi_stream_moe:
stage: compile
enabled: true
Expand Down
27 changes: 23 additions & 4 deletions examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@ attn_backend: trtllm
max_seq_len: 262144
max_num_tokens: 8192
max_batch_size: 32
cuda_graph_batch_sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]
world_size: 8
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
enable_chunked_prefill: true
model_factory: AutoModelForCausalLM
model_factory: Qwen3_5MoeForConditionalGeneration
kv_cache_config:
enable_block_reuse: true
enable_block_reuse: false
free_gpu_memory_fraction: 0.8
tokens_per_block: 32
model_kwargs:
torch_dtype: bfloat16
transforms:
initialize_mrope_delta_cache:
enabled: true
export_to_gm:
num_moe_experts_for_export: 2
fuse_gemms_mixed_children:
Expand All @@ -23,6 +24,24 @@ transforms:
allreduce_strategy: SYMM_MEM
shard_all_unprocessed: true
simple_shard_filter: "lm_head"
sharding_dims: ['tp','ep', 'bmm']
# use only manual config for TP sharding
sharding_source: ['manual']
manual_config:
tp_plan:
# GDN layer
"in_proj_qkv": "delta"
# attention layer
"q_proj": "colwise"
"k_proj": "colwise"
"v_proj": "colwise"
"o_proj": "rowwise"
# replicating shared experts (keep them commented out)
# "shared_expert_gate_proj": "colwise"
# "shared_expert_up_proj": "colwise"
# "shared_expert_down_proj": "rowwise"
# gating layer should be replicated as well
# "gate": "gather"
multi_stream_moe:
stage: compile
enabled: true
Expand Down
4 changes: 2 additions & 2 deletions examples/auto_deploy/model_registry/models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -215,9 +215,9 @@ models:
yaml_extra: ['dashboard_default.yaml', 'world_size_4.yaml']
# --- Qwen3.5 MoE (Feb 2026) ---
- name: Qwen/Qwen3.5-35B-A3B
yaml_extra: ['qwen3.5_moe_35b.yaml']
yaml_extra: ['dashboard_default.yaml', 'world_size_2.yaml', 'qwen3.5_moe_35b.yaml']
- name: Qwen/Qwen3.5-397B-A17B
yaml_extra: ['qwen3.5_moe_400b.yaml']
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml', 'qwen3.5_moe_400b.yaml']
# --- GLM-5 (Feb 2026) ---
- name: zai-org/GLM-5
yaml_extra: ['dashboard_default.yaml', 'world_size_8.yaml']
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

import copy # noqa: I001
import gc
from collections import abc
from typing import Any, Callable, Dict, List, Optional, Set, Tuple

import torch
Expand Down Expand Up @@ -608,6 +609,14 @@ def forward(self, *args, **kwargs) -> Any:
result = self.piecewise(*args, num_tokens=bucket, **kwargs)
ADPiecewiseRunner.set_current_num_tokens(None)
if bucket > num_tokens:
# HF ModelOutput iterates over field names (e.g. "logits"), not
# tensor values. Normalize to the payload tuple before slicing.
if hasattr(result, "to_tuple"):
result = result.to_tuple()
elif isinstance(result, abc.Mapping):
result = tuple(result.values())
else:
result = tuple(result)
result = tuple(r[:, :num_tokens] if r.ndim >= 2 else r for r in result)
return result

Expand Down
4 changes: 4 additions & 0 deletions tensorrt_llm/_torch/auto_deploy/config/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,10 @@ transforms:
insert_cached_residual_add:
stage: cache_init
backend: cached_residual_add
initialize_mrope_delta_cache:
stage: cache_init
run_per_gm: false
enabled: false
initialize_cache:
stage: cache_init
expect_mem_change: true
Expand Down
7 changes: 6 additions & 1 deletion tensorrt_llm/_torch/auto_deploy/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def __call__(
# for example, this might be the case when invoking AD via trtllm-serve
elif "multi_modal_data" in inputs:
images = inputs["multi_modal_data"]["image"]
do_rescale = True
if images is not None and isinstance(images[0], torch.Tensor):
# The default multimodal input loader will normalize images to [0, 1] when the requested
# format is "pt" (pytorch tensors), but not for "pil" (PIL images).
Expand Down Expand Up @@ -127,7 +128,11 @@ def _validate_args_for_torch_backend(self, kwargs: dict) -> None:
pass

def _create_input_processor(self) -> ADInputProcessor:
return ADInputProcessor(self.tokenizer, self.factory.init_processor())
processor = self.factory.init_processor()
base = ADInputProcessor(self.tokenizer, processor)
if hasattr(self.factory, "init_input_processor"):
return self.factory.init_input_processor(base)
return base

def _prefetch_model(self):
"""Prefetch the model for the LLM."""
Expand Down
Loading
Loading