[TRTLLM-10617][feat] LTX-2 Model Support#12009
[TRTLLM-10617][feat] LTX-2 Model Support#12009yibinl-nvidia wants to merge 42 commits intoNVIDIA:mainfrom
Conversation
|
/bot run --disable-fail-fast |
|
/bot kill |
5e9cd00 to
5b76802
Compare
|
PR_Github #38147 [ run ] triggered by Bot. Commit: |
|
PR_Github #38148 [ kill ] triggered by Bot. Commit: |
|
PR_Github #38147 [ run ] completed with state |
|
PR_Github #38148 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #38152 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request adds comprehensive LTX-2 text-to-video and image-to-video generation support to TensorRT-LLM, including multi-modal transformer architecture, audio/video VAE decoders, vocoder, checkpoint loading infrastructure, unified pipeline integration, and extensive test coverage. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Pipeline as LTX2Pipeline
participant TextEnc as Text Encoder
participant Connector as Embeddings1DConnector
participant Transformer as LTXModel
participant VideoVAE as Video VAE
participant AudioVAE as Audio VAE
participant Vocoder
User->>Pipeline: Text prompt (+ optional image)
activate Pipeline
Pipeline->>TextEnc: Encode text
activate TextEnc
TextEnc-->>Pipeline: Caption embeddings
deactivate TextEnc
Pipeline->>Connector: Process caption embeddings
activate Connector
Connector-->>Pipeline: Enhanced context
deactivate Connector
Pipeline->>Transformer: Video/Audio Modality + context
activate Transformer
Transformer->>Transformer: Multi-head self-attention (video/audio)
Transformer->>Transformer: Cross-attention (text context)
Transformer->>Transformer: AV cross-attention (video↔audio)
Transformer->>Transformer: Feed-forward layers
Transformer-->>Pipeline: Denoised video latents + audio latents
deactivate Transformer
Pipeline->>VideoVAE: Decode video latents
activate VideoVAE
VideoVAE-->>Pipeline: Video frames
deactivate VideoVAE
Pipeline->>AudioVAE: Decode audio latents
activate AudioVAE
AudioVAE->>Vocoder: Pass through vocoder
activate Vocoder
Vocoder-->>AudioVAE: Audio waveform
deactivate Vocoder
AudioVAE-->>Pipeline: Audio signal
deactivate AudioVAE
Pipeline-->>User: Video frames + audio
deactivate Pipeline
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 15
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/autotuner.py (1)
1304-1337:⚠️ Potential issue | 🟠 Major
skip_dynamic_tuning_bucketsstill expands the profile grid.This branch seeds
opt_shapeswith the mapped bucket, but Line 1327 always addsactual_valback. Whenmap_to_tuning_buckets()changes the value, skip mode ends up profiling both shapes, and the cartesian product can still blow up across multiple dynamic dims. That defeats the purpose of the flag and makes the docstring inaccurate. Please keep a single opt value in skip mode, or make cache lookup use the same non-bucketed keying rule.🔧 Suggested direction
for spec in tuning_config.dynamic_tensor_specs: assert callable(spec.gen_tuning_buckets) or isinstance(spec.gen_tuning_buckets, (list, tuple)), \ "The given dynamic dimension must provide a opt value generation function or a list of opt values" + add_actual_input_value = True if self.skip_dynamic_tuning_buckets: # Still include the bucketed value of the actual shape so the # cache key used during profiling (raw) aligns with the key # used during inference (bucketed via map_to_tuning_buckets). actual_val = base_profile.shapes[spec.input_idx][ spec.dim_idx].val - if spec.map_to_tuning_buckets is not None: - opt_shapes = (spec.map_to_tuning_buckets(actual_val), ) - else: - opt_shapes = () + opt_shapes = ( + (spec.map_to_tuning_buckets(actual_val), ) + if spec.map_to_tuning_buckets is not None else (actual_val, ) + ) + add_actual_input_value = False elif callable(spec.gen_tuning_buckets): if tuning_config.tune_max_num_tokens is None: # Use the current input size as the opt value opt_shapes = spec.gen_tuning_buckets( base_profile.shapes[spec.input_idx][spec.dim_idx].val) @@ # Add the current input value as one of the opt values opt_shapes = set(opt_shapes) - if tuning_config.tune_max_num_tokens is not None: + if add_actual_input_value and tuning_config.tune_max_num_tokens is not None: opt_shapes.add( min( tuning_config.tune_max_num_tokens, base_profile.shapes[spec.input_idx][spec.dim_idx].val, )) - else: + elif add_actual_input_value: opt_shapes.add( base_profile.shapes[spec.input_idx][spec.dim_idx].val)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/autotuner.py` around lines 1304 - 1337, The skip_dynamic_tuning_buckets branch is seeding opt_shapes with the mapped bucket but later code (lines adding base_profile/shapes val) always re-adds the actual_val, causing two values and expanding the grid; change the skip_dynamic_tuning_buckets handling so opt_shapes is a single value only: when skip_dynamic_tuning_buckets is true, set opt_shapes to a one-item iterable containing spec.map_to_tuning_buckets(actual_val) if map_to_tuning_buckets is not None, otherwise the actual_val, and ensure the later logic that adds the current input value (the opt_shapes.add(...) block that references tuning_config.tune_max_num_tokens and base_profile.shapes[...] ) is skipped for this case so no second value is introduced (use the skip_dynamic_tuning_buckets flag to bypass that addition).
🟡 Minor comments (19)
tensorrt_llm/_torch/visual_gen/parallelism.py-45-46 (1)
45-46:⚠️ Potential issue | 🟡 MinorDeclare the new
Nonecontract explicitly.This guard makes
Nonea supported input, but the signature and docstring still advertisemodel_configas required. That leaves callers and type checkers with the wrong contract. Please either change the parameter toOptional[DiffusionModelConfig]and document the early return, or raise here instead of silently widening the API. As per coding guidelines, externally visible Python interfaces should be documented with docstrings.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/parallelism.py` around lines 45 - 46, Update the function that contains the guard checking "if model_config is None: return False, 1, None, 0" to make the API contract explicit: either change the parameter annotation to Optional[DiffusionModelConfig] and update the function's docstring to document the early-return behavior and returned tuple semantics, or instead raise a ValueError at that guard to keep model_config required; locate references to the parameter name model_config and the containing function (e.g., the function signature that declares model_config) and ensure the type annotation, docstring, and any callers are updated accordingly to match the chosen behavior.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.py-90-91 (1)
90-91:⚠️ Potential issue | 🟡 MinorThe helper docstring does not match what the function returns.
The docstring says this only skips video self-attention, but the returned config also adds
SKIP_AUDIO_SELF_ATTN. Please update the text so callers do not configure STG from incorrect docs.Proposed fix
def build_stg_perturbation_config(stg_blocks: list[int]) -> PerturbationConfig: - """Build a perturbation config that skips video self-attention at *stg_blocks*.""" + """Build a perturbation config that skips video and audio self-attention at *stg_blocks*."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.py` around lines 90 - 91, Update the docstring for build_stg_perturbation_config to accurately describe the returned PerturbationConfig: state that it configures skipping video self-attention on the provided stg_blocks and also enables skipping audio self-attention (SKIP_AUDIO_SELF_ATTN) for those blocks, so callers know both video and audio self-attention are affected; mention the parameter stg_blocks and the returned PerturbationConfig to make intent and usage clear.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py-25-27 (1)
25-27:⚠️ Potential issue | 🟡 MinorThe public defaults are internally inconsistent.
The docstring says the reference defaults are
stg_scale=1.0,modality_scale=3.0, andstg_blocks=[29], but the dataclass actually defaults to0.0,1.0, and[]. Please align the docs and the fields (or vice versa), because those values change which guidance branches run by default.Also applies to: 31-34
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py` around lines 25 - 27, The docstring lists reference defaults (stg_scale=1.0, modality_scale=3.0, stg_blocks=[29]) but the dataclass fields stg_scale, modality_scale, and stg_blocks are set to 0.0, 1.0, and [] respectively; reconcile them by either updating the dataclass field defaults to match the docstring (set stg_scale=1.0, modality_scale=3.0, stg_blocks=[29]) or updating the docstring to reflect the current defaults, and apply the same fix to the other occurrences mentioned (lines 31–34) so the docs and runtime defaults are consistent and the guidance-branch behavior is deterministic.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorReplace the EN DASH in the SPDX year range.
Ruff is already flagging Line 1 as
RUF003, so this file will keep linting noisy until2025–2026uses a plain ASCII-.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py` at line 1, The SPDX header uses an en dash in the year range "2025–2026" which triggers RUF003; update the SPDX-FileCopyrightText line to replace the en dash with a plain ASCII hyphen so the year range reads "2025-2026" (match the exact SPDX header token "SPDX-FileCopyrightText" and the string "2025–2026" to locate and fix the character).tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.py-49-57 (1)
49-57:⚠️ Potential issue | 🟡 MinorPotential division by zero when
terminal == 1.0.Line 54 computes
scale_factor = one_minus_z[-1] / (1.0 - terminal). Ifterminal == 1.0, this causes a division by zero error.🛡️ Proposed fix to add guard
# Stretch sigmas so final value matches terminal if stretch: + if terminal >= 1.0: + raise ValueError("terminal must be < 1.0 for stretch mode") non_zero_mask = sigmas != 0 non_zero_sigmas = sigmas[non_zero_mask] one_minus_z = 1.0 - non_zero_sigmas scale_factor = one_minus_z[-1] / (1.0 - terminal)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.py` around lines 49 - 57, The stretch branch in the scheduler manipulates sigmas and computes scale_factor = one_minus_z[-1] / (1.0 - terminal), which will divide by zero when terminal == 1.0; update the logic in the stretch block (around variables sigmas, non_zero_mask, one_minus_z, scale_factor, stretched) to guard the denominator: if (1.0 - terminal) is effectively zero (use a small epsilon) then skip stretching or set a safe fallback (e.g., set scale_factor = 1.0 or leave sigmas unchanged) to avoid the division-by-zero, otherwise compute scale_factor as before and apply the stretched assignment to sigmas[non_zero_mask].tensorrt_llm/_torch/visual_gen/config.py-599-614 (1)
599-614:⚠️ Potential issue | 🟡 MinorAvoid silent exception swallowing with
try-except-pass.The bare
except Exception: passsilently ignores all errors, making issues hard to debug. At minimum, log the exception at debug level.♻️ Proposed fix
try: with safetensors.torch.safe_open(str(sft_files[0]), framework="pt") as f: meta = f.metadata() if meta and "config" in meta: config = json.loads(meta["config"]) if "quantization_config" in meta: config["quantization_config"] = json.loads(meta["quantization_config"]) elif "_quantization_metadata" in meta: qmeta = json.loads(meta["_quantization_metadata"]) converted = cls._convert_quantization_metadata(qmeta, list(f.keys())) if converted: config["quantization_config"] = converted return config - except Exception: - pass + except (OSError, json.JSONDecodeError, KeyError) as e: + logger.debug(f"Failed to load safetensors config from {sft_files[0]}: {e}") return NoneAs per coding guidelines: "When using try-except blocks in Python, limit the except to the smallest set of errors possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/config.py` around lines 599 - 614, The current try/except around safetensors.torch.safe_open silently swallows all exceptions; narrow the except to the expected error types (e.g., OSError, json.JSONDecodeError, ValueError, KeyError) and log the exception at debug level instead of passing so failures reading sft_files[0], parsing meta["config"], or converting via cls._convert_quantization_metadata are visible; use the module logger (e.g., logging.getLogger(__name__)) and call logger.debug or logger.exception with exc_info to include the stacktrace, then return None as before.tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py-201-210 (1)
201-210:⚠️ Potential issue | 🟡 MinorSilent exception handling hides potential errors.
The bare
except Exception: passswallows all exceptions without logging, making debugging difficult. At minimum, log the exception or be more specific about expected exceptions.🛡️ Suggested improvement
def _read_safetensors_config(path: str) -> Optional[Dict[str, Any]]: """Read the ``config`` key from safetensors metadata header.""" try: with safetensors.torch.safe_open(path, framework="pt") as f: meta = f.metadata() if meta and "config" in meta: return json.loads(meta["config"]) - except Exception: - pass + except (OSError, json.JSONDecodeError, KeyError) as e: + logger.debug(f"Could not read config from {path}: {e}") return NoneAs per coding guidelines: "When using try-except blocks in Python, limit the except to the smallest set of errors possible."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py` around lines 201 - 210, The function _read_safetensors_config currently swallows all errors with a bare except; change this to catch only expected exceptions (e.g., safetensors errors, OSError/FileNotFoundError, and json.JSONDecodeError) and log the failure instead of silently passing; add/ensure a module logger (logger = logging.getLogger(__name__)) and call logger.exception or logger.error with the path and error details when an exception is caught so callers can diagnose issues.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.py-162-171 (1)
162-171:⚠️ Potential issue | 🟡 MinorUse
math.sqrtinstead oftorch.sqrtfor scalar operations.
torch.sqrt(5)creates a 0-d tensor, which is incompatible withkaiming_uniform_'saparameter (expectsfloat). Other similar implementations in the codebase correctly usemath.sqrt(5). Apply the same approach to lines 167 and 170 for consistency.🐛 Proposed fix
+import math + def reset_parameters(self) -> None: - nn.init.kaiming_uniform_(self.weight1, a=torch.sqrt(5)) - nn.init.kaiming_uniform_(self.weight2, a=torch.sqrt(5)) + nn.init.kaiming_uniform_(self.weight1, a=math.sqrt(5)) + nn.init.kaiming_uniform_(self.weight2, a=math.sqrt(5)) if self.bias: fan_in1, _ = nn.init._calculate_fan_in_and_fan_out(self.weight1) - bound1 = 1 / torch.sqrt(fan_in1) + bound1 = 1 / math.sqrt(fan_in1) nn.init.uniform_(self.bias1, -bound1, bound1) fan_in2, _ = nn.init._calculate_fan_in_and_fan_out(self.weight2) - bound2 = 1 / torch.sqrt(fan_in2) + bound2 = 1 / math.sqrt(fan_in2) nn.init.uniform_(self.bias2, -bound2, bound2)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.py` around lines 162 - 171, The reset_parameters method uses torch.sqrt(5) which returns a 0-d tensor and is incompatible with nn.init.kaiming_uniform_ and the scalar bounds; change those to use math.sqrt(5) (import math if not already) and similarly replace torch.sqrt(...) used when computing bound1/bound2 with math.sqrt(...) so that the a parameter and the uniform bounds are plain floats; update references in reset_parameters for self.weight1, self.weight2, self.bias1, and self.bias2 accordingly.tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorUse hyphen-minus instead of en-dash in copyright year range.
The year range uses an en-dash (
–) which can cause encoding issues. Replace with hyphen-minus (-).Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py` at line 1, The copyright header in __init__.py uses an en-dash (–) in the year range; replace it with a standard ASCII hyphen-minus (-) so the line reads "2025-2026" (edit the top-of-file copyright string in tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py).tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorUse hyphen-minus instead of en-dash in copyright year range.
Same issue as other files: replace en-dash (
–) with hyphen-minus (-).Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.py` at line 1, Replace the en-dash character in the copyright header string "# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd." with a standard hyphen-minus so it reads "2025-2026"; locate that header line (the SPDX copyright comment) in normalization.py and update the range separator from `–` to `-`.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorUse hyphen-minus instead of en-dash in copyright year range.
Same issue as other files: replace en-dash (
–) with hyphen-minus (-).Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd. +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.py` at line 1, Replace the en-dash used in the copyright year range at the top of the file with a ASCII hyphen-minus; specifically update the header line that currently contains "Copyright (c) 2025–2026 Lightricks Ltd." to use "2025-2026" so the dash character is the standard hyphen-minus.examples/visual_gen/README.md-288-315 (1)
288-315:⚠️ Potential issue | 🟡 MinorFix table column count mismatches.
Multiple rows in the Common Arguments table have 5 columns instead of 6. The header defines:
Argument | FLUX | WAN | LTX2 | Default | Description, but several rows are missing a column value.For example, line 290 has:
| `--model_path` | ✓ | ✓ | — | Path to model checkpoint directory |This is missing the
Defaultcolumn value.Example fix for line 290
-| `--model_path` | ✓ | ✓ | — | Path to model checkpoint directory | +| `--model_path` | ✓ | ✓ | ✓ | — | Path to model checkpoint directory |Similar fixes needed for lines 291-293, 297, 301-302, 309-315.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/visual_gen/README.md` around lines 288 - 315, The table rows for arguments like `--model_path`, `--text_encoder_path`, `--prompt`, `--negative_prompt`, `--height/width/num_frames/frame_rate` entries, `--image`, `--image_cond_strength`, `--prompts_file`, `--output_dir`, `--disable_torch_compile`, `--enhance_prompt`, `--stg_scale`, `--modality_scale`, and `--rescale_scale` are missing the "Default" column causing a 5-column row; update each affected row (e.g., the rows containing `--model_path`, `--text_encoder_path`, `--prompt`, `--negative_prompt`, `--num_frames`, `--frame_rate`, `--image`, `--image_cond_strength`, `--prompts_file`, `--output_dir`, `--enhance_prompt`, `--stg_scale`, `--modality_scale`, `--rescale_scale`) to include a sixth cell between the LTX2 column and the Description column with the correct default value (use the appropriate default shown elsewhere in the file or a placeholder like `—`, `None`, or the numeric default such as `1024 / 720`, `81 / 121`, `24.0`, `1.0`, etc.) so every row matches the header `Argument | FLUX | WAN | LTX2 | Default | Description`.examples/visual_gen/serve/README.md-65-68 (1)
65-68:⚠️ Potential issue | 🟡 MinorRemove LTX-2 from the image-generation example docs.
This section documents
POST /v1/images/generations, but elsewhere in the same README LTX-2 is described as video generation with audio. Leaving it here sends users to the wrong example and endpoint.📝 Proposed fix
-Demonstrates synchronous text-to-image generation using the OpenAI SDK. Supports FLUX.1, FLUX.2, and LTX-2. +Demonstrates synchronous text-to-image generation using the OpenAI SDK. Supports FLUX.1 and FLUX.2.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/visual_gen/serve/README.md` around lines 65 - 68, The README's "Synchronous Image Generation (`sync_image_gen.py`)" section incorrectly lists LTX-2 as a supported model for the POST /v1/images/generations example; remove LTX-2 from the supported models list (leave FLUX.1 and FLUX.2) in that section and ensure the section text and any mentions of `sync_image_gen.py` only reference image models, not LTX-2/video models, so the example points to the correct endpoint.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.py-10-40 (1)
10-40:⚠️ Potential issue | 🟡 MinorFail fast on unsupported tensor ranks.
If a caller passes anything other than 4D or 5D here, both helpers currently return the tensor unchanged. That hides misuse and makes the eventual shape failure much harder to diagnose.
🩹 Proposed fix
def patchify(x: torch.Tensor, patch_size_hw: int, patch_size_t: int = 1) -> torch.Tensor: """Rearrange spatial patches into the channel dimension (inverse of :func:`unpatchify`).""" if patch_size_hw == 1 and patch_size_t == 1: return x if x.dim() == 4: x = rearrange(x, "b c (h q) (w r) -> b (c r q) h w", q=patch_size_hw, r=patch_size_hw) elif x.dim() == 5: x = rearrange( x, "b c (f p) (h q) (w r) -> b (c p r q) f h w", p=patch_size_t, q=patch_size_hw, r=patch_size_hw, ) + else: + raise ValueError(f"patchify expects a 4D or 5D tensor, got {x.dim()}D") return x def unpatchify(x: torch.Tensor, patch_size_hw: int, patch_size_t: int = 1) -> torch.Tensor: if patch_size_hw == 1 and patch_size_t == 1: return x if x.dim() == 4: x = rearrange(x, "b (c r q) h w -> b c (h q) (w r)", q=patch_size_hw, r=patch_size_hw) elif x.dim() == 5: x = rearrange( x, "b (c p r q) f h w -> b c (f p) (h q) (w r)", p=patch_size_t, q=patch_size_hw, r=patch_size_hw, ) + else: + raise ValueError(f"unpatchify expects a 4D or 5D tensor, got {x.dim()}D") return x🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.py` around lines 10 - 40, Both patchify and unpatchify silently return tensors with unsupported ranks; change them to validate tensor rank after the trivial patch_size==1 short-circuit and raise a clear ValueError if x.dim() is not 4 or 5. Specifically, in patchify and unpatchify, after handling the patch_size == 1 case, check x.dim(); if it's neither 4 nor 5, raise an error that includes the function name (patchify/unpatchify), the received x.dim(), and the expected ranks (4 or 5) so callers fail fast with a helpful message.examples/visual_gen/visual_gen_ltx2.py-1-12 (1)
1-12:⚠️ Potential issue | 🟡 MinorMissing SPDX copyright header.
Per coding guidelines, all Python source files should contain an NVIDIA copyright header. This example script is missing the standard Apache 2.0 license block.
📝 Suggested header
#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. +# SPDX-License-Identifier: Apache-2.0 """LTX2 Text/Image-to-Video generation using TensorRT-LLM Visual Generation."""As per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/visual_gen/visual_gen_ltx2.py` around lines 1 - 12, Add the missing NVIDIA Apache-2.0 copyright header to the top of this Python script (after the existing shebang line) so it follows project guidelines; insert the standard SPDX-License-Identifier: Apache-2.0 and the full NVIDIA copyright/license block used across TensorRT-LLM files into visual_gen_ltx2.py (the module that imports VisualGen/VisualGenParams and sets logger level) ensuring the header appears before any imports or code.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py-160-160 (1)
160-160:⚠️ Potential issue | 🟡 MinorMutable default argument: use
Noneinstead of[].Mutable default arguments like
[]are shared across all instances, which can lead to subtle bugs.Suggested fix
def __init__( self, convolution_dimensions: int = 3, in_channels: int = 3, out_channels: int = 128, - encoder_blocks: List[Tuple[str, int | dict]] = [], + encoder_blocks: List[Tuple[str, int | dict]] | None = None, patch_size: int = 4, norm_layer: NormLayerType = NormLayerType.PIXEL_NORM, causal: bool = True, timestep_conditioning: bool = False, encoder_spatial_padding_mode: PaddingModeType = PaddingModeType.ZEROS, ): super().__init__() + if encoder_blocks is None: + encoder_blocks = [] self.patch_size = patch_size🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py` at line 160, The parameter encoder_blocks currently uses a mutable default list (encoder_blocks: List[Tuple[str, int | dict]] = []); change the signature to use None as the default (encoder_blocks: ... = None) and inside the constructor or function (where encoder_blocks is processed) set encoder_blocks = [] if encoder_blocks is None to avoid sharing the same list across instances; update any type checks or usages accordingly to treat None as "no blocks" and preserve existing behavior.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py-334-334 (1)
334-334:⚠️ Potential issue | 🟡 MinorSame mutable default argument issue.
Suggested fix
def __init__( self, convolution_dimensions: int = 3, in_channels: int = 128, out_channels: int = 3, - decoder_blocks: List[Tuple[str, int | dict]] = [], + decoder_blocks: List[Tuple[str, int | dict]] | None = None, patch_size: int = 4, ... ): super().__init__() + if decoder_blocks is None: + decoder_blocks = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py` at line 334, The parameter decoder_blocks currently uses a mutable default list (decoder_blocks: List[Tuple[str, int | dict]] = []), which can lead to shared state bugs; change its default to None and inside the function (or __init__) check if decoder_blocks is None and then assign an empty list (e.g., decoder_blocks = []), ensuring subsequent mutations are local; update any type hints or usages of decoder_blocks accordingly and keep the parameter name decoder_blocks to locate the change.tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.py-45-45 (1)
45-45:⚠️ Potential issue | 🟡 Minor
post_act_fnparameter is declared but never used.The
post_act_fnparameter is accepted but never used to initializeself.post_act, which is always set toNone(line 62). The forward method checksself.post_actbut it will always beNone.This appears to be dead code. Either implement the post-activation or remove the parameter.
Option 1: Remove unused parameter
def __init__( self, in_channels: int, time_embed_dim: int, out_dim: int | None = None, - post_act_fn: str | None = None, cond_proj_dim: int | None = None, sample_proj_bias: bool = True, make_linear=None, ):Option 2: Implement the functionality
self.linear_2 = make_linear(time_embed_dim, time_embed_dim_out, bias=sample_proj_bias) - self.post_act = None + if post_act_fn == "silu": + self.post_act = torch.nn.SiLU() + elif post_act_fn is not None: + raise ValueError(f"Unknown post_act_fn: {post_act_fn}") + else: + self.post_act = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.py` at line 45, The constructor argument post_act_fn is accepted but never assigned to self.post_act (so self.post_act remains None) causing the forward path that checks self.post_act to be dead; fix by either removing the post_act_fn parameter and all uses of self.post_act in the class, or implement it by mapping post_act_fn to an activation callable and assign it to self.post_act in __init__ (e.g., support names like "gelu", "relu" or a passed callable) and then let forward call self.post_act(tensor) when present; update the __init__ signature and the forward method accordingly (look for post_act_fn, self.post_act, __init__ and forward in this file).tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py-177-193 (1)
177-193:⚠️ Potential issue | 🟡 MinorValidate mapper cardinality per axis before building the Cartesian product.
Right now only the final product sizes are compared via
zip(..., strict=True). A mapper that returns a different number ofoutput_slicesormasks_1dthan input intervals can still pair the wrong tiles if the per-axis counts happen to multiply to the same total.🧭 Possible fix
starts = dimension_intervals.starts ends = dimension_intervals.ends input_slices = [slice(s, e) for s, e in zip(starts, ends, strict=True)] output_slices, masks_1d = mappers[axis_index](dimension_intervals) + if len(output_slices) != len(input_slices) or len(masks_1d) != len(input_slices): + raise ValueError( + f"Mapper for axis {axis_index} must return one output slice and mask per input interval" + ) full_dim_input_slices.append(input_slices) full_dim_output_slices.append(output_slices) full_dim_masks_1d.append(masks_1d)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py` around lines 177 - 193, The code builds Cartesian products of per-axis input/output slices and masks but only enforces total cardinality via zip(..., strict=True), which can mask per-axis mismatches; before calling itertools.product, iterate each axis (using intervals.dimension_intervals and mappers[axis_index]) and validate that for that axis the counts match (e.g., len(input_slices) == len(output_slices) == len(masks_1d)); if any axis mismatches, raise a clear ValueError including the axis index and the three lengths; only after all per-axis counts are verified, proceed to build tile_in_coords/tile_out_coords/tile_mask_1ds and create Tile(in_coords=..., out_coords=..., masks_1d=...).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 85afa3b5-cb20-4dbb-a8b8-e5f71c730a59
📒 Files selected for processing (65)
.pre-commit-config.yamlLICENSEexamples/visual_gen/README.mdexamples/visual_gen/serve/README.mdexamples/visual_gen/serve/configs/ltx2.ymlexamples/visual_gen/visual_gen_examples.shexamples/visual_gen/visual_gen_ltx2.pytensorrt_llm/_torch/autotuner.pytensorrt_llm/_torch/visual_gen/checkpoints/weight_loader.pytensorrt_llm/_torch/visual_gen/config.pytensorrt_llm/_torch/visual_gen/executor.pytensorrt_llm/_torch/visual_gen/models/__init__.pytensorrt_llm/_torch/visual_gen/models/ltx2/NOTICEtensorrt_llm/_torch/visual_gen/models/ltx2/__init__.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/__init__.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/adaln.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/attention.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/__init__.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/attention.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/audio_vae.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/causal_conv_2d.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/causality_axis.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/model_configurator.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/resnet.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/upsample.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/vocoder.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/connector.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/diffusion_steps.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/modality.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/normalization.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/patchifier.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/protocols.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/rope.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/scheduler_adapter.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/text_projection.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/transformer_args.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/types.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/utils_ltx2.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/__init__.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/enums.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/model_configurator.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/resnet.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/sampling.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.pytensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.pytensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.pytensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.pytensorrt_llm/_torch/visual_gen/parallelism.pytensorrt_llm/_torch/visual_gen/pipeline.pytensorrt_llm/_torch/visual_gen/pipeline_loader.pytensorrt_llm/_torch/visual_gen/pipeline_registry.pytensorrt_llm/_torch/visual_gen/quantization/loader.pytensorrt_llm/_torch/visual_gen/teacache.pytensorrt_llm/llmapi/visual_gen.pytests/unittest/_torch/visual_gen/test_ltx2_attention.pytests/unittest/_torch/visual_gen/test_ltx2_pipeline.pytests/unittest/_torch/visual_gen/test_ltx2_transformer.py
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/__init__.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/audio_vae.py
Show resolved
Hide resolved
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/model_configurator.py
Show resolved
Hide resolved
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py
Show resolved
Hide resolved
|
PR_Github #38152 [ run ] completed with state
|
|
|
||
| Supports two layouts: | ||
|
|
||
| * **Pipeline (diffusers)** -- ``model_index.json`` with component |
There was a problem hiding this comment.
Better to add links to these layouts for reference.
There was a problem hiding this comment.
Agreed, we should have a reference to describe the LTX-2 format. I am thinking about two options:
- Add it inline (comments) in weight loading code.
- In README
2 might be overkill because our README is already packed with information. But the format discussion could get pretty long if we move it to code comments. What do you think?
A sample layout reference
LTX-2 Specific Checkpoint Format
Similar to standard HF single safetensors:
- Single .safetensors file containing all weights
- Standard safetensors binary format
Key differences:
1. Embedded config in metadata — the safetensors header contains a "config" key with the full JSON config for all components (transformer, VAE, audio VAE,
vocoder). Standard HF models keep config in a separate config.json.
2. Non-standard weight key prefixes:
- Transformer: model.diffusion_model.* (not transformer.* or bare keys)
- Video VAE: vae.decoder.*
- Audio VAE: audio_vae.decoder.*
- Vocoder: vocoder.*
3. Multiple components in one file — the single checkpoint bundles the denoiser, video VAE, audio VAE, vocoder, and connectors together. Standard HF checkpoints
are typically one model per file.
4. Text encoder is separate — Gemma3 lives in its own directory and is loaded via the standard from_pretrained() path.
Detection Logic in the TRT-LLM codebase
1. No model_index.json present → not diffusers
2. Safetensors metadata "config" key contains both "transformer" and "vae" → LTX2Pipeline
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/__init__.py
Show resolved
Hide resolved
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
015f0f4 to
c9578a4
Compare
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Description
This PR implements LTX-2 model one stage pipeline with optimizations.
Notes for reviewer:
LTX-2 code is under Lightricks community license. All files under ltx2 folder have license header.
Major Changes breakdown:
Test Coverage
See tests/ folder changes.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.