Skip to content

[TRTLLM-10617][feat] LTX-2 Model Support#12009

Open
yibinl-nvidia wants to merge 42 commits intoNVIDIA:mainfrom
yibinl-nvidia:dev-yibinl-ltx2-support
Open

[TRTLLM-10617][feat] LTX-2 Model Support#12009
yibinl-nvidia wants to merge 42 commits intoNVIDIA:mainfrom
yibinl-nvidia:dev-yibinl-ltx2-support

Conversation

@yibinl-nvidia
Copy link
Collaborator

@yibinl-nvidia yibinl-nvidia commented Mar 8, 2026

Summary by CodeRabbit

  • New Features

    • Added support for LTX-2 model enabling text-to-video and image-to-video generation with audio output.
    • Introduced new generation parameters: STG guidance, modality scaling, rescale scaling, guidance skip, and enhanced prompts.
    • Added image conditioning strength control for more precise image-to-video generation.
  • Bug Fixes

    • Fixed optimization profile handling for dynamic tuning buckets.
  • Documentation

    • Updated visual generation documentation with LTX-2 configuration and usage examples.
  • Chores

    • Added LTX-2 Community License Agreement.
    • Improved checkpoint format support for monolithic and pipeline layouts.

Description

This PR implements LTX-2 model one stage pipeline with optimizations.

Notes for reviewer:

LTX-2 code is under Lightricks community license. All files under ltx2 folder have license header.

Major Changes breakdown:

  • ltx2_core: ltx2 components ported from LTX-2 repo and modified for TRT-LLM optimization.
  • pipeline_ltx2.py: the one stage pipeline.
  • transformer_ltx2.py: the transformer component in the pipeline.

Test Coverage

See tests/ folder changes.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yibinl-nvidia
Copy link
Collaborator Author

/bot run --disable-fail-fast

@yibinl-nvidia
Copy link
Collaborator Author

/bot kill

@yibinl-nvidia yibinl-nvidia force-pushed the dev-yibinl-ltx2-support branch from 5e9cd00 to 5b76802 Compare March 8, 2026 16:18
@tensorrt-cicd
Copy link
Collaborator

PR_Github #38147 [ run ] triggered by Bot. Commit: 5b76802 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38148 [ kill ] triggered by Bot. Commit: 5b76802 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38147 [ run ] completed with state ABORTED. Commit: 5b76802

Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38148 [ kill ] completed with state SUCCESS. Commit: 5b76802
Successfully killed previous jobs for commit 5b76802

Link to invocation

@yibinl-nvidia yibinl-nvidia marked this pull request as ready for review March 8, 2026 17:52
@yibinl-nvidia yibinl-nvidia requested review from a team as code owners March 8, 2026 17:52
@yibinl-nvidia
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38152 [ run ] triggered by Bot. Commit: 541d0bd Link to invocation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 8, 2026

📝 Walkthrough

Walkthrough

This pull request adds comprehensive LTX-2 text-to-video and image-to-video generation support to TensorRT-LLM, including multi-modal transformer architecture, audio/video VAE decoders, vocoder, checkpoint loading infrastructure, unified pipeline integration, and extensive test coverage.

Changes

Cohort / File(s) Summary
Licensing & Configuration
.pre-commit-config.yaml, LICENSE, examples/visual_gen/serve/configs/ltx2.yml
Added LTX-2 Community License, extended codespell ignore list, introduced LTX-2 YAML configuration with text encoder and attention backend settings.
Documentation & Examples
examples/visual_gen/README.md, examples/visual_gen/serve/README.md, examples/visual_gen/visual_gen_examples.sh, examples/visual_gen/visual_gen_ltx2.py
Added LTX-2 model documentation including usage examples, CLI arguments, supported formats (mp4 with audio), and comprehensive example scripts for text-to-video and image-to-video workflows.
Core Visual Gen Infrastructure
tensorrt_llm/_torch/autotuner.py, tensorrt_llm/_torch/visual_gen/checkpoints/weight_loader.py, tensorrt_llm/_torch/visual_gen/config.py, tensorrt_llm/_torch/visual_gen/executor.py, tensorrt_llm/_torch/visual_gen/parallelism.py
Enhanced checkpoint loading to support both diffusers pipeline and monolithic safetensors layouts; extended DiffusionArgs with text encoder path; added configuration methods for quantization metadata and safetensors config loading; introduced multi-modal guidance parameters (stg_scale, modality_scale, etc.); added null-check guard in sequence parallelism setup.
LTX-2 Core Modules
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/...
Implemented comprehensive LTX-2 core library including attention blocks with gating and RoPE support, transformer argument preprocessing, timestep/text embeddings, modality data structures, diffusion schedulers, RoPE positional embeddings, normalization layers, patchification utilities, and utility functions for diffusion operations.
LTX-2 Audio VAE
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/...
Implemented complete audio decoder pipeline with latent denormalization, upsampling stages, optional attention blocks, vocoder integration, and per-channel statistics handling; added causal convolution support and causality axis enum for temporal causality control.
LTX-2 Video VAE
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/...
Implemented video encoder/decoder with configurable resolution, patch handling, optional timestep conditioning, 3D convolutions with causal variants, spatial/temporal tiling for memory efficiency, and dimension-aware convolution utilities supporting standard and dual-path 3D convolutions.
LTX-2 Transformer & Pipeline
tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py, tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py, tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/__init__.py, tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py, tensorrt_llm/_torch/visual_gen/models/ltx2/NOTICE
Added multi-modal transformer supporting video and/or audio with Ulysses sequence parallelism, AdaLN modulation, cross-attention, and FP8 quantization support; integrated complete end-to-end pipeline with weight loading, component initialization, multi-modal guidance (CFG, STG, modality), image conditioning, and latent-to-pixel decoding.
Pipeline & Inference Infrastructure
tensorrt_llm/_torch/visual_gen/pipeline.py, tensorrt_llm/_torch/visual_gen/pipeline_loader.py, tensorrt_llm/_torch/visual_gen/pipeline_registry.py, tensorrt_llm/_torch/visual_gen/quantization/loader.py, tensorrt_llm/_torch/visual_gen/teacache.py
Extended BasePipeline with load_transformer_weights method, integrated text encoder path propagation, enhanced pipeline detection via safetensors metadata, made DynamicLinearWeightLoader model_config optional with guarded defaults, and refined TeaCache docstring.
Public API Extensions
tensorrt_llm/llmapi/visual_gen.py
Expanded VisualGenParams with LTX-2 specific parameters (image_cond_strength, stg_scale, stg_blocks, modality_scale, rescale_scale, guidance_skip_step, enhance_prompt) and propagated them into DiffusionRequest.
Test Suite
tests/unittest/_torch/visual_gen/test_ltx2_attention.py, tests/unittest/_torch/visual_gen/test_ltx2_transformer.py, tests/unittest/_torch/visual_gen/test_ltx2_pipeline.py
Added comprehensive unit and integration tests validating LTX-2 attention backends (VANILLA/TRTLLM equivalence), transformer structure and forward passes (VideoOnly/AudioVideo variants), FP8 quantization correctness and memory efficiency, and end-to-end pipeline inference with multi-modal inputs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pipeline as LTX2Pipeline
    participant TextEnc as Text Encoder
    participant Connector as Embeddings1DConnector
    participant Transformer as LTXModel
    participant VideoVAE as Video VAE
    participant AudioVAE as Audio VAE
    participant Vocoder

    User->>Pipeline: Text prompt (+ optional image)
    activate Pipeline
    Pipeline->>TextEnc: Encode text
    activate TextEnc
    TextEnc-->>Pipeline: Caption embeddings
    deactivate TextEnc
    
    Pipeline->>Connector: Process caption embeddings
    activate Connector
    Connector-->>Pipeline: Enhanced context
    deactivate Connector
    
    Pipeline->>Transformer: Video/Audio Modality + context
    activate Transformer
    Transformer->>Transformer: Multi-head self-attention (video/audio)
    Transformer->>Transformer: Cross-attention (text context)
    Transformer->>Transformer: AV cross-attention (video↔audio)
    Transformer->>Transformer: Feed-forward layers
    Transformer-->>Pipeline: Denoised video latents + audio latents
    deactivate Transformer
    
    Pipeline->>VideoVAE: Decode video latents
    activate VideoVAE
    VideoVAE-->>Pipeline: Video frames
    deactivate VideoVAE
    
    Pipeline->>AudioVAE: Decode audio latents
    activate AudioVAE
    AudioVAE->>Vocoder: Pass through vocoder
    activate Vocoder
    Vocoder-->>AudioVAE: Audio waveform
    deactivate Vocoder
    AudioVAE-->>Pipeline: Audio signal
    deactivate AudioVAE
    
    Pipeline-->>User: Video frames + audio
    deactivate Pipeline
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/autotuner.py (1)

1304-1337: ⚠️ Potential issue | 🟠 Major

skip_dynamic_tuning_buckets still expands the profile grid.

This branch seeds opt_shapes with the mapped bucket, but Line 1327 always adds actual_val back. When map_to_tuning_buckets() changes the value, skip mode ends up profiling both shapes, and the cartesian product can still blow up across multiple dynamic dims. That defeats the purpose of the flag and makes the docstring inaccurate. Please keep a single opt value in skip mode, or make cache lookup use the same non-bucketed keying rule.

🔧 Suggested direction
         for spec in tuning_config.dynamic_tensor_specs:
             assert callable(spec.gen_tuning_buckets) or isinstance(spec.gen_tuning_buckets, (list, tuple)), \
                 "The given dynamic dimension must provide a opt value generation function or a list of opt values"
+            add_actual_input_value = True
             if self.skip_dynamic_tuning_buckets:
                 # Still include the bucketed value of the actual shape so the
                 # cache key used during profiling (raw) aligns with the key
                 # used during inference (bucketed via map_to_tuning_buckets).
                 actual_val = base_profile.shapes[spec.input_idx][
                     spec.dim_idx].val
-                if spec.map_to_tuning_buckets is not None:
-                    opt_shapes = (spec.map_to_tuning_buckets(actual_val), )
-                else:
-                    opt_shapes = ()
+                opt_shapes = (
+                    (spec.map_to_tuning_buckets(actual_val), )
+                    if spec.map_to_tuning_buckets is not None else (actual_val, )
+                )
+                add_actual_input_value = False
             elif callable(spec.gen_tuning_buckets):
                 if tuning_config.tune_max_num_tokens is None:
                     # Use the current input size as the opt value
                     opt_shapes = spec.gen_tuning_buckets(
                         base_profile.shapes[spec.input_idx][spec.dim_idx].val)
@@
             # Add the current input value as one of the opt values
             opt_shapes = set(opt_shapes)
-            if tuning_config.tune_max_num_tokens is not None:
+            if add_actual_input_value and tuning_config.tune_max_num_tokens is not None:
                 opt_shapes.add(
                     min(
                         tuning_config.tune_max_num_tokens,
                         base_profile.shapes[spec.input_idx][spec.dim_idx].val,
                     ))
-            else:
+            elif add_actual_input_value:
                 opt_shapes.add(
                     base_profile.shapes[spec.input_idx][spec.dim_idx].val)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/autotuner.py` around lines 1304 - 1337, The
skip_dynamic_tuning_buckets branch is seeding opt_shapes with the mapped bucket
but later code (lines adding base_profile/shapes val) always re-adds the
actual_val, causing two values and expanding the grid; change the
skip_dynamic_tuning_buckets handling so opt_shapes is a single value only: when
skip_dynamic_tuning_buckets is true, set opt_shapes to a one-item iterable
containing spec.map_to_tuning_buckets(actual_val) if map_to_tuning_buckets is
not None, otherwise the actual_val, and ensure the later logic that adds the
current input value (the opt_shapes.add(...) block that references
tuning_config.tune_max_num_tokens and base_profile.shapes[...] ) is skipped for
this case so no second value is introduced (use the skip_dynamic_tuning_buckets
flag to bypass that addition).
🟡 Minor comments (19)
tensorrt_llm/_torch/visual_gen/parallelism.py-45-46 (1)

45-46: ⚠️ Potential issue | 🟡 Minor

Declare the new None contract explicitly.

This guard makes None a supported input, but the signature and docstring still advertise model_config as required. That leaves callers and type checkers with the wrong contract. Please either change the parameter to Optional[DiffusionModelConfig] and document the early return, or raise here instead of silently widening the API. As per coding guidelines, externally visible Python interfaces should be documented with docstrings.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/parallelism.py` around lines 45 - 46, Update
the function that contains the guard checking "if model_config is None: return
False, 1, None, 0" to make the API contract explicit: either change the
parameter annotation to Optional[DiffusionModelConfig] and update the function's
docstring to document the early-return behavior and returned tuple semantics, or
instead raise a ValueError at that guard to keep model_config required; locate
references to the parameter name model_config and the containing function (e.g.,
the function signature that declares model_config) and ensure the type
annotation, docstring, and any callers are updated accordingly to match the
chosen behavior.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.py-90-91 (1)

90-91: ⚠️ Potential issue | 🟡 Minor

The helper docstring does not match what the function returns.

The docstring says this only skips video self-attention, but the returned config also adds SKIP_AUDIO_SELF_ATTN. Please update the text so callers do not configure STG from incorrect docs.

Proposed fix
 def build_stg_perturbation_config(stg_blocks: list[int]) -> PerturbationConfig:
-    """Build a perturbation config that skips video self-attention at *stg_blocks*."""
+    """Build a perturbation config that skips video and audio self-attention at *stg_blocks*."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.py` around
lines 90 - 91, Update the docstring for build_stg_perturbation_config to
accurately describe the returned PerturbationConfig: state that it configures
skipping video self-attention on the provided stg_blocks and also enables
skipping audio self-attention (SKIP_AUDIO_SELF_ATTN) for those blocks, so
callers know both video and audio self-attention are affected; mention the
parameter stg_blocks and the returned PerturbationConfig to make intent and
usage clear.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py-25-27 (1)

25-27: ⚠️ Potential issue | 🟡 Minor

The public defaults are internally inconsistent.

The docstring says the reference defaults are stg_scale=1.0, modality_scale=3.0, and stg_blocks=[29], but the dataclass actually defaults to 0.0, 1.0, and []. Please align the docs and the fields (or vice versa), because those values change which guidance branches run by default.

Also applies to: 31-34

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py` around lines
25 - 27, The docstring lists reference defaults (stg_scale=1.0,
modality_scale=3.0, stg_blocks=[29]) but the dataclass fields stg_scale,
modality_scale, and stg_blocks are set to 0.0, 1.0, and [] respectively;
reconcile them by either updating the dataclass field defaults to match the
docstring (set stg_scale=1.0, modality_scale=3.0, stg_blocks=[29]) or updating
the docstring to reflect the current defaults, and apply the same fix to the
other occurrences mentioned (lines 31–34) so the docs and runtime defaults are
consistent and the guidance-branch behavior is deterministic.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Replace the EN DASH in the SPDX year range.

Ruff is already flagging Line 1 as RUF003, so this file will keep linting noisy until 2025–2026 uses a plain ASCII -.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py` at line 1,
The SPDX header uses an en dash in the year range "2025–2026" which triggers
RUF003; update the SPDX-FileCopyrightText line to replace the en dash with a
plain ASCII hyphen so the year range reads "2025-2026" (match the exact SPDX
header token "SPDX-FileCopyrightText" and the string "2025–2026" to locate and
fix the character).
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.py-49-57 (1)

49-57: ⚠️ Potential issue | 🟡 Minor

Potential division by zero when terminal == 1.0.

Line 54 computes scale_factor = one_minus_z[-1] / (1.0 - terminal). If terminal == 1.0, this causes a division by zero error.

🛡️ Proposed fix to add guard
         # Stretch sigmas so final value matches terminal
         if stretch:
+            if terminal >= 1.0:
+                raise ValueError("terminal must be < 1.0 for stretch mode")
             non_zero_mask = sigmas != 0
             non_zero_sigmas = sigmas[non_zero_mask]
             one_minus_z = 1.0 - non_zero_sigmas
             scale_factor = one_minus_z[-1] / (1.0 - terminal)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.py` around
lines 49 - 57, The stretch branch in the scheduler manipulates sigmas and
computes scale_factor = one_minus_z[-1] / (1.0 - terminal), which will divide by
zero when terminal == 1.0; update the logic in the stretch block (around
variables sigmas, non_zero_mask, one_minus_z, scale_factor, stretched) to guard
the denominator: if (1.0 - terminal) is effectively zero (use a small epsilon)
then skip stretching or set a safe fallback (e.g., set scale_factor = 1.0 or
leave sigmas unchanged) to avoid the division-by-zero, otherwise compute
scale_factor as before and apply the stretched assignment to
sigmas[non_zero_mask].
tensorrt_llm/_torch/visual_gen/config.py-599-614 (1)

599-614: ⚠️ Potential issue | 🟡 Minor

Avoid silent exception swallowing with try-except-pass.

The bare except Exception: pass silently ignores all errors, making issues hard to debug. At minimum, log the exception at debug level.

♻️ Proposed fix
         try:
             with safetensors.torch.safe_open(str(sft_files[0]), framework="pt") as f:
                 meta = f.metadata()
                 if meta and "config" in meta:
                     config = json.loads(meta["config"])
                     if "quantization_config" in meta:
                         config["quantization_config"] = json.loads(meta["quantization_config"])
                     elif "_quantization_metadata" in meta:
                         qmeta = json.loads(meta["_quantization_metadata"])
                         converted = cls._convert_quantization_metadata(qmeta, list(f.keys()))
                         if converted:
                             config["quantization_config"] = converted
                     return config
-        except Exception:
-            pass
+        except (OSError, json.JSONDecodeError, KeyError) as e:
+            logger.debug(f"Failed to load safetensors config from {sft_files[0]}: {e}")
         return None

As per coding guidelines: "When using try-except blocks in Python, limit the except to the smallest set of errors possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/config.py` around lines 599 - 614, The current
try/except around safetensors.torch.safe_open silently swallows all exceptions;
narrow the except to the expected error types (e.g., OSError,
json.JSONDecodeError, ValueError, KeyError) and log the exception at debug level
instead of passing so failures reading sft_files[0], parsing meta["config"], or
converting via cls._convert_quantization_metadata are visible; use the module
logger (e.g., logging.getLogger(__name__)) and call logger.debug or
logger.exception with exc_info to include the stacktrace, then return None as
before.
tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py-201-210 (1)

201-210: ⚠️ Potential issue | 🟡 Minor

Silent exception handling hides potential errors.

The bare except Exception: pass swallows all exceptions without logging, making debugging difficult. At minimum, log the exception or be more specific about expected exceptions.

🛡️ Suggested improvement
 def _read_safetensors_config(path: str) -> Optional[Dict[str, Any]]:
     """Read the ``config`` key from safetensors metadata header."""
     try:
         with safetensors.torch.safe_open(path, framework="pt") as f:
             meta = f.metadata()
             if meta and "config" in meta:
                 return json.loads(meta["config"])
-    except Exception:
-        pass
+    except (OSError, json.JSONDecodeError, KeyError) as e:
+        logger.debug(f"Could not read config from {path}: {e}")
     return None

As per coding guidelines: "When using try-except blocks in Python, limit the except to the smallest set of errors possible."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py` around lines 201
- 210, The function _read_safetensors_config currently swallows all errors with
a bare except; change this to catch only expected exceptions (e.g., safetensors
errors, OSError/FileNotFoundError, and json.JSONDecodeError) and log the failure
instead of silently passing; add/ensure a module logger (logger =
logging.getLogger(__name__)) and call logger.exception or logger.error with the
path and error details when an exception is caught so callers can diagnose
issues.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.py-162-171 (1)

162-171: ⚠️ Potential issue | 🟡 Minor

Use math.sqrt instead of torch.sqrt for scalar operations.

torch.sqrt(5) creates a 0-d tensor, which is incompatible with kaiming_uniform_'s a parameter (expects float). Other similar implementations in the codebase correctly use math.sqrt(5). Apply the same approach to lines 167 and 170 for consistency.

🐛 Proposed fix
+import math
+
     def reset_parameters(self) -> None:
-        nn.init.kaiming_uniform_(self.weight1, a=torch.sqrt(5))
-        nn.init.kaiming_uniform_(self.weight2, a=torch.sqrt(5))
+        nn.init.kaiming_uniform_(self.weight1, a=math.sqrt(5))
+        nn.init.kaiming_uniform_(self.weight2, a=math.sqrt(5))
         if self.bias:
             fan_in1, _ = nn.init._calculate_fan_in_and_fan_out(self.weight1)
-            bound1 = 1 / torch.sqrt(fan_in1)
+            bound1 = 1 / math.sqrt(fan_in1)
             nn.init.uniform_(self.bias1, -bound1, bound1)
             fan_in2, _ = nn.init._calculate_fan_in_and_fan_out(self.weight2)
-            bound2 = 1 / torch.sqrt(fan_in2)
+            bound2 = 1 / math.sqrt(fan_in2)
             nn.init.uniform_(self.bias2, -bound2, bound2)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.py`
around lines 162 - 171, The reset_parameters method uses torch.sqrt(5) which
returns a 0-d tensor and is incompatible with nn.init.kaiming_uniform_ and the
scalar bounds; change those to use math.sqrt(5) (import math if not already) and
similarly replace torch.sqrt(...) used when computing bound1/bound2 with
math.sqrt(...) so that the a parameter and the uniform bounds are plain floats;
update references in reset_parameters for self.weight1, self.weight2,
self.bias1, and self.bias2 accordingly.
tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Use hyphen-minus instead of en-dash in copyright year range.

The year range uses an en-dash () which can cause encoding issues. Replace with hyphen-minus (-).

Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py` at line 1, The
copyright header in __init__.py uses an en-dash (–) in the year range; replace
it with a standard ASCII hyphen-minus (-) so the line reads "2025-2026" (edit
the top-of-file copyright string in
tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py).
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Use hyphen-minus instead of en-dash in copyright year range.

Same issue as other files: replace en-dash () with hyphen-minus (-).

Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.py`
at line 1, Replace the en-dash character in the copyright header string "#
SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd." with a standard
hyphen-minus so it reads "2025-2026"; locate that header line (the SPDX
copyright comment) in normalization.py and update the range separator from `–`
to `-`.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.py-1-1 (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Use hyphen-minus instead of en-dash in copyright year range.

Same issue as other files: replace en-dash () with hyphen-minus (-).

Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025–2026 Lightricks Ltd.
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 Lightricks Ltd.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.py` at
line 1, Replace the en-dash used in the copyright year range at the top of the
file with a ASCII hyphen-minus; specifically update the header line that
currently contains "Copyright (c) 2025–2026 Lightricks Ltd." to use "2025-2026"
so the dash character is the standard hyphen-minus.
examples/visual_gen/README.md-288-315 (1)

288-315: ⚠️ Potential issue | 🟡 Minor

Fix table column count mismatches.

Multiple rows in the Common Arguments table have 5 columns instead of 6. The header defines: Argument | FLUX | WAN | LTX2 | Default | Description, but several rows are missing a column value.

For example, line 290 has:

| `--model_path` | ✓ | ✓ | — | Path to model checkpoint directory |

This is missing the Default column value.

Example fix for line 290
-| `--model_path` | ✓ | ✓ | — | Path to model checkpoint directory |
+| `--model_path` | ✓ | ✓ | ✓ | — | Path to model checkpoint directory |

Similar fixes needed for lines 291-293, 297, 301-302, 309-315.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/visual_gen/README.md` around lines 288 - 315, The table rows for
arguments like `--model_path`, `--text_encoder_path`, `--prompt`,
`--negative_prompt`, `--height/width/num_frames/frame_rate` entries, `--image`,
`--image_cond_strength`, `--prompts_file`, `--output_dir`,
`--disable_torch_compile`, `--enhance_prompt`, `--stg_scale`,
`--modality_scale`, and `--rescale_scale` are missing the "Default" column
causing a 5-column row; update each affected row (e.g., the rows containing
`--model_path`, `--text_encoder_path`, `--prompt`, `--negative_prompt`,
`--num_frames`, `--frame_rate`, `--image`, `--image_cond_strength`,
`--prompts_file`, `--output_dir`, `--enhance_prompt`, `--stg_scale`,
`--modality_scale`, `--rescale_scale`) to include a sixth cell between the LTX2
column and the Description column with the correct default value (use the
appropriate default shown elsewhere in the file or a placeholder like `—`,
`None`, or the numeric default such as `1024 / 720`, `81 / 121`, `24.0`, `1.0`,
etc.) so every row matches the header `Argument | FLUX | WAN | LTX2 | Default |
Description`.
examples/visual_gen/serve/README.md-65-68 (1)

65-68: ⚠️ Potential issue | 🟡 Minor

Remove LTX-2 from the image-generation example docs.

This section documents POST /v1/images/generations, but elsewhere in the same README LTX-2 is described as video generation with audio. Leaving it here sends users to the wrong example and endpoint.

📝 Proposed fix
-Demonstrates synchronous text-to-image generation using the OpenAI SDK. Supports FLUX.1, FLUX.2, and LTX-2.
+Demonstrates synchronous text-to-image generation using the OpenAI SDK. Supports FLUX.1 and FLUX.2.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/visual_gen/serve/README.md` around lines 65 - 68, The README's
"Synchronous Image Generation (`sync_image_gen.py`)" section incorrectly lists
LTX-2 as a supported model for the POST /v1/images/generations example; remove
LTX-2 from the supported models list (leave FLUX.1 and FLUX.2) in that section
and ensure the section text and any mentions of `sync_image_gen.py` only
reference image models, not LTX-2/video models, so the example points to the
correct endpoint.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.py-10-40 (1)

10-40: ⚠️ Potential issue | 🟡 Minor

Fail fast on unsupported tensor ranks.

If a caller passes anything other than 4D or 5D here, both helpers currently return the tensor unchanged. That hides misuse and makes the eventual shape failure much harder to diagnose.

🩹 Proposed fix
 def patchify(x: torch.Tensor, patch_size_hw: int, patch_size_t: int = 1) -> torch.Tensor:
     """Rearrange spatial patches into the channel dimension (inverse of :func:`unpatchify`)."""
     if patch_size_hw == 1 and patch_size_t == 1:
         return x
     if x.dim() == 4:
         x = rearrange(x, "b c (h q) (w r) -> b (c r q) h w", q=patch_size_hw, r=patch_size_hw)
     elif x.dim() == 5:
         x = rearrange(
             x,
             "b c (f p) (h q) (w r) -> b (c p r q) f h w",
             p=patch_size_t,
             q=patch_size_hw,
             r=patch_size_hw,
         )
+    else:
+        raise ValueError(f"patchify expects a 4D or 5D tensor, got {x.dim()}D")
     return x
 
 
 def unpatchify(x: torch.Tensor, patch_size_hw: int, patch_size_t: int = 1) -> torch.Tensor:
     if patch_size_hw == 1 and patch_size_t == 1:
         return x
     if x.dim() == 4:
         x = rearrange(x, "b (c r q) h w -> b c (h q) (w r)", q=patch_size_hw, r=patch_size_hw)
     elif x.dim() == 5:
         x = rearrange(
             x,
             "b (c p r q) f h w -> b c (f p) (h q) (w r)",
             p=patch_size_t,
             q=patch_size_hw,
             r=patch_size_hw,
         )
+    else:
+        raise ValueError(f"unpatchify expects a 4D or 5D tensor, got {x.dim()}D")
     return x
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.py` around
lines 10 - 40, Both patchify and unpatchify silently return tensors with
unsupported ranks; change them to validate tensor rank after the trivial
patch_size==1 short-circuit and raise a clear ValueError if x.dim() is not 4 or
5. Specifically, in patchify and unpatchify, after handling the patch_size == 1
case, check x.dim(); if it's neither 4 nor 5, raise an error that includes the
function name (patchify/unpatchify), the received x.dim(), and the expected
ranks (4 or 5) so callers fail fast with a helpful message.
examples/visual_gen/visual_gen_ltx2.py-1-12 (1)

1-12: ⚠️ Potential issue | 🟡 Minor

Missing SPDX copyright header.

Per coding guidelines, all Python source files should contain an NVIDIA copyright header. This example script is missing the standard Apache 2.0 license block.

📝 Suggested header
 #!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES.
+# SPDX-License-Identifier: Apache-2.0
 """LTX2 Text/Image-to-Video generation using TensorRT-LLM Visual Generation."""

As per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/visual_gen/visual_gen_ltx2.py` around lines 1 - 12, Add the missing
NVIDIA Apache-2.0 copyright header to the top of this Python script (after the
existing shebang line) so it follows project guidelines; insert the standard
SPDX-License-Identifier: Apache-2.0 and the full NVIDIA copyright/license block
used across TensorRT-LLM files into visual_gen_ltx2.py (the module that imports
VisualGen/VisualGenParams and sets logger level) ensuring the header appears
before any imports or code.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py-160-160 (1)

160-160: ⚠️ Potential issue | 🟡 Minor

Mutable default argument: use None instead of [].

Mutable default arguments like [] are shared across all instances, which can lead to subtle bugs.

Suggested fix
     def __init__(
         self,
         convolution_dimensions: int = 3,
         in_channels: int = 3,
         out_channels: int = 128,
-        encoder_blocks: List[Tuple[str, int | dict]] = [],
+        encoder_blocks: List[Tuple[str, int | dict]] | None = None,
         patch_size: int = 4,
         norm_layer: NormLayerType = NormLayerType.PIXEL_NORM,
         causal: bool = True,
         timestep_conditioning: bool = False,
         encoder_spatial_padding_mode: PaddingModeType = PaddingModeType.ZEROS,
     ):
         super().__init__()
+        if encoder_blocks is None:
+            encoder_blocks = []
         self.patch_size = patch_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py`
at line 160, The parameter encoder_blocks currently uses a mutable default list
(encoder_blocks: List[Tuple[str, int | dict]] = []); change the signature to use
None as the default (encoder_blocks: ... = None) and inside the constructor or
function (where encoder_blocks is processed) set encoder_blocks = [] if
encoder_blocks is None to avoid sharing the same list across instances; update
any type checks or usages accordingly to treat None as "no blocks" and preserve
existing behavior.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py-334-334 (1)

334-334: ⚠️ Potential issue | 🟡 Minor

Same mutable default argument issue.

Suggested fix
     def __init__(
         self,
         convolution_dimensions: int = 3,
         in_channels: int = 128,
         out_channels: int = 3,
-        decoder_blocks: List[Tuple[str, int | dict]] = [],
+        decoder_blocks: List[Tuple[str, int | dict]] | None = None,
         patch_size: int = 4,
         ...
     ):
         super().__init__()
+        if decoder_blocks is None:
+            decoder_blocks = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py`
at line 334, The parameter decoder_blocks currently uses a mutable default list
(decoder_blocks: List[Tuple[str, int | dict]] = []), which can lead to shared
state bugs; change its default to None and inside the function (or __init__)
check if decoder_blocks is None and then assign an empty list (e.g.,
decoder_blocks = []), ensuring subsequent mutations are local; update any type
hints or usages of decoder_blocks accordingly and keep the parameter name
decoder_blocks to locate the change.
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.py-45-45 (1)

45-45: ⚠️ Potential issue | 🟡 Minor

post_act_fn parameter is declared but never used.

The post_act_fn parameter is accepted but never used to initialize self.post_act, which is always set to None (line 62). The forward method checks self.post_act but it will always be None.

This appears to be dead code. Either implement the post-activation or remove the parameter.

Option 1: Remove unused parameter
     def __init__(
         self,
         in_channels: int,
         time_embed_dim: int,
         out_dim: int | None = None,
-        post_act_fn: str | None = None,
         cond_proj_dim: int | None = None,
         sample_proj_bias: bool = True,
         make_linear=None,
     ):
Option 2: Implement the functionality
         self.linear_2 = make_linear(time_embed_dim, time_embed_dim_out, bias=sample_proj_bias)
-        self.post_act = None
+        if post_act_fn == "silu":
+            self.post_act = torch.nn.SiLU()
+        elif post_act_fn is not None:
+            raise ValueError(f"Unknown post_act_fn: {post_act_fn}")
+        else:
+            self.post_act = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.py`
at line 45, The constructor argument post_act_fn is accepted but never assigned
to self.post_act (so self.post_act remains None) causing the forward path that
checks self.post_act to be dead; fix by either removing the post_act_fn
parameter and all uses of self.post_act in the class, or implement it by mapping
post_act_fn to an activation callable and assign it to self.post_act in __init__
(e.g., support names like "gelu", "relu" or a passed callable) and then let
forward call self.post_act(tensor) when present; update the __init__ signature
and the forward method accordingly (look for post_act_fn, self.post_act,
__init__ and forward in this file).
tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py-177-193 (1)

177-193: ⚠️ Potential issue | 🟡 Minor

Validate mapper cardinality per axis before building the Cartesian product.

Right now only the final product sizes are compared via zip(..., strict=True). A mapper that returns a different number of output_slices or masks_1d than input intervals can still pair the wrong tiles if the per-axis counts happen to multiply to the same total.

🧭 Possible fix
         starts = dimension_intervals.starts
         ends = dimension_intervals.ends
         input_slices = [slice(s, e) for s, e in zip(starts, ends, strict=True)]
         output_slices, masks_1d = mappers[axis_index](dimension_intervals)
+        if len(output_slices) != len(input_slices) or len(masks_1d) != len(input_slices):
+            raise ValueError(
+                f"Mapper for axis {axis_index} must return one output slice and mask per input interval"
+            )
         full_dim_input_slices.append(input_slices)
         full_dim_output_slices.append(output_slices)
         full_dim_masks_1d.append(masks_1d)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py`
around lines 177 - 193, The code builds Cartesian products of per-axis
input/output slices and masks but only enforces total cardinality via zip(...,
strict=True), which can mask per-axis mismatches; before calling
itertools.product, iterate each axis (using intervals.dimension_intervals and
mappers[axis_index]) and validate that for that axis the counts match (e.g.,
len(input_slices) == len(output_slices) == len(masks_1d)); if any axis
mismatches, raise a clear ValueError including the axis index and the three
lengths; only after all per-axis counts are verified, proceed to build
tile_in_coords/tile_out_coords/tile_mask_1ds and create Tile(in_coords=...,
out_coords=..., masks_1d=...).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 85afa3b5-cb20-4dbb-a8b8-e5f71c730a59

📥 Commits

Reviewing files that changed from the base of the PR and between 69b6203 and 541d0bd.

📒 Files selected for processing (65)
  • .pre-commit-config.yaml
  • LICENSE
  • examples/visual_gen/README.md
  • examples/visual_gen/serve/README.md
  • examples/visual_gen/serve/configs/ltx2.yml
  • examples/visual_gen/visual_gen_examples.sh
  • examples/visual_gen/visual_gen_ltx2.py
  • tensorrt_llm/_torch/autotuner.py
  • tensorrt_llm/_torch/visual_gen/checkpoints/weight_loader.py
  • tensorrt_llm/_torch/visual_gen/config.py
  • tensorrt_llm/_torch/visual_gen/executor.py
  • tensorrt_llm/_torch/visual_gen/models/__init__.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/NOTICE
  • tensorrt_llm/_torch/visual_gen/models/ltx2/__init__.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/__init__.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/adaln.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/attention.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/__init__.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/attention.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/audio_vae.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/causal_conv_2d.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/causality_axis.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/model_configurator.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/ops.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/resnet.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/upsample.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/audio_vae/vocoder.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/connector.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/diffusion_steps.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/guiders.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/modality.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/normalization.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/patchifier.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/perturbations.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/protocols.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/rope.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/scheduler_adapter.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/schedulers.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/text_projection.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/timestep_embedding.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/transformer_args.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/types.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/utils_ltx2.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/__init__.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/convolution.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/enums.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/model_configurator.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/normalization.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/ops.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/resnet.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/sampling.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/tiling.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/ltx2_core/video_vae/video_vae.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/transformer_ltx2.py
  • tensorrt_llm/_torch/visual_gen/parallelism.py
  • tensorrt_llm/_torch/visual_gen/pipeline.py
  • tensorrt_llm/_torch/visual_gen/pipeline_loader.py
  • tensorrt_llm/_torch/visual_gen/pipeline_registry.py
  • tensorrt_llm/_torch/visual_gen/quantization/loader.py
  • tensorrt_llm/_torch/visual_gen/teacache.py
  • tensorrt_llm/llmapi/visual_gen.py
  • tests/unittest/_torch/visual_gen/test_ltx2_attention.py
  • tests/unittest/_torch/visual_gen/test_ltx2_pipeline.py
  • tests/unittest/_torch/visual_gen/test_ltx2_transformer.py

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38152 [ run ] completed with state SUCCESS. Commit: 541d0bd
/LLM/main/L0_MergeRequest_PR pipeline #29558 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation


Supports two layouts:

* **Pipeline (diffusers)** -- ``model_index.json`` with component
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add links to these layouts for reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we should have a reference to describe the LTX-2 format. I am thinking about two options:

  1. Add it inline (comments) in weight loading code.
  2. In README

2 might be overkill because our README is already packed with information. But the format discussion could get pretty long if we move it to code comments. What do you think?

A sample layout reference

 LTX-2 Specific Checkpoint Format

  Similar to standard HF single safetensors:
  - Single .safetensors file containing all weights
  - Standard safetensors binary format

  Key differences:

  1. Embedded config in metadata — the safetensors header contains a "config" key with the full JSON config for all components (transformer, VAE, audio VAE,
  vocoder). Standard HF models keep config in a separate config.json.
  2. Non-standard weight key prefixes:
    - Transformer: model.diffusion_model.* (not transformer.* or bare keys)
    - Video VAE: vae.decoder.*
    - Audio VAE: audio_vae.decoder.*
    - Vocoder: vocoder.*
  3. Multiple components in one file — the single checkpoint bundles the denoiser, video VAE, audio VAE, vocoder, and connectors together. Standard HF checkpoints
  are typically one model per file.
  4. Text encoder is separate — Gemma3 lives in its own directory and is loaded via the standard from_pretrained() path.

  Detection Logic in the TRT-LLM codebase
  1. No model_index.json present → not diffusers
  2. Safetensors metadata "config" key contains both "transformer" and "vae" → LTX2Pipeline

Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
@yibinl-nvidia yibinl-nvidia force-pushed the dev-yibinl-ltx2-support branch from 015f0f4 to c9578a4 Compare March 12, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants