NVIDIA
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎LICENSE‎
Lines changed: 396 additions & 0 deletions b/‎LICENSE‎
Lines changed: 396 additions & 0 deletions
diff --git a/‎docs/source/models/visual-generation.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/source/models/visual-generation.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎examples/visual_gen/README.md‎
Lines changed: 85 additions & 14 deletions b/‎examples/visual_gen/README.md‎
Lines changed: 85 additions & 14 deletions
diff --git a/‎examples/visual_gen/serve/README.md‎
Lines changed: 49 additions & 4 deletions b/‎examples/visual_gen/serve/README.md‎
Lines changed: 49 additions & 4 deletions
diff --git a/‎examples/visual_gen/serve/configs/ltx2.yml‎
Lines changed: 8 additions & 0 deletions b/‎examples/visual_gen/serve/configs/ltx2.yml‎
Lines changed: 8 additions & 0 deletions
@@ -1442,7 +1442,7 @@ repos:
         additional_dependencies:
         - tomli
         # add ignore words list
-        args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
+        args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
         exclude: 'scripts/attribution/data/cas/.*$'
 -   repo: https://github.com/astral-sh/ruff-pre-commit
     rev: v0.9.4
 
@@ -30,8 +30,9 @@ TensorRT-LLM **VisualGen** provides a unified inference stack for diffusion mode
 | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | Image-to-Video |
 | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | Text-to-Video |
 | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | Image-to-Video |
+| `Lightricks/LTX-Video` | Text-to-Video (with Audio), Image-to-Video (with Audio) |
 
-Models are auto-detected from the `model_index.json` file in the checkpoint directory. The `AutoPipeline` registry selects the appropriate pipeline class automatically.
+Models are auto-detected from the checkpoint directory. Diffusers-format models are detected via `model_index.json`; LTX-2 monolithic safetensors checkpoints are detected via embedded metadata. The `AutoPipeline` registry selects the appropriate pipeline class automatically.
 
 ### Feature Matrix
 
@@ -41,6 +42,7 @@ Models are auto-detected from the `model_index.json` file in the checkpoint dire
 | **FLUX.2** | Yes | Yes | Yes | No [^1] | Yes | No | Yes | Yes | Yes |
 | **Wan 2.1** | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
 | **Wan 2.2** | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes |
+| **LTX-2** | Yes | Yes | No | Yes | Yes | No | No | Yes | Yes |
 
 [^1]: FLUX models use embedded guidance and do not have a separate negative prompt path, so CFG parallelism is not applicable.
 
 
@@ -139,22 +139,92 @@ python visual_gen_wan_i2v.py \
 ```
 
 
+## LTX2 (Text/Image-to-Video with Audio)
+
+LTX2 generates video **with audio** from text prompts or input images.
+It uses a Gemma3 text encoder (provided separately via `--text_encoder_path`)
+and supports BF16, FP8, and FP4 precision checkpoints.
+
+Please refer to tensorrt_llm/_torch/visual_gen/models/ltx2/LTX_2_CHECKPOINT_FORMAT.md for model checkpoint info.
+
+### Basic Usage
+
+**Text-to-Video (single GPU):**
+```bash
+python visual_gen_ltx2.py \
+    --model_path ${MODEL_ROOT}/LTX-2-checkpoint/ \
+    --text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
+    --prompt "A cute cat playing piano" \
+    --height 720 --width 1280 --num_frames 121 \
+    --steps 40 --guidance_scale 4.0 --seed 42 \
+    --output_path output_t2v.mp4
+```
+
+**Image-to-Video:**
+```bash
+python visual_gen_ltx2.py \
+    --model_path ${MODEL_ROOT}/LTX-2-checkpoint/ \
+    --text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
+    --prompt "A cute cat playing piano" \
+    --image ${PROJECT_ROOT}/examples/visual_gen/cat_piano.png \
+    --image_cond_strength 1.0 \
+    --height 720 --width 1280 --num_frames 121 \
+    --steps 40 --seed 42 \
+    --output_path output_i2v.mp4
+```
+
+### Precision Variants
+
+LTX2 ships checkpoints at three precision levels. Simply point `--model_path` at the
+appropriate directory:
+
+```bash
+# FP8
+python visual_gen_ltx2.py \
+    --model_path ${MODEL_ROOT}/LTX-2-checkpoint/fp8/ \
+    --text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
+    --prompt "A cute cat playing piano" \
+    --height 720 --width 1280 --num_frames 121 \
+    --output_path output_fp8.mp4
+
+# FP4
+python visual_gen_ltx2.py \
+    --model_path ${MODEL_ROOT}/LTX-2-checkpoint/fp4/ \
+    --text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
+    --prompt "A cute cat playing piano" \
+    --height 512 --width 768 --num_frames 121 \
+    --output_path output_fp4.mp4
+```
+
+---
+
 ## Common Arguments
 
-| Argument | FLUX | WAN | Default | Description |
-|----------|------|-----|---------|-------------|
-| `--height` | ✓ | ✓ | 1024 / 720 | Output height |
-| `--width` | ✓ | ✓ | 1024 / 1280 | Output width |
-| `--num_frames` | — | ✓ | 81 | Number of frames |
-| `--steps` | ✓ | ✓ | 50 | Denoising steps |
-| `--guidance_scale` | ✓ | ✓ | 3.5 / 5.0 | Guidance strength |
-| `--seed` | ✓ | ✓ | 42 | Random seed |
-| `--enable_teacache` | ✓ | ✓ | False | Cache optimization |
-| `--teacache_thresh` | ✓ | ✓ | 0.2 | TeaCache similarity threshold |
-| `--attention_backend` | ✓ | ✓ | VANILLA | `VANILLA`, `TRTLLM`, or `FA4` |
-| `--cfg_size` | — | ✓ | 1 | CFG parallelism |
-| `--ulysses_size` | ✓ | ✓ | 1 | Sequence parallelism |
-| `--linear_type` | ✓ | ✓ | default | Quantization type |
+| Argument | FLUX | WAN | LTX2 | Default | Description |
+|----------|------|-----|------|---------|-------------|
+| `--model_path` | ✓ | ✓ | — | Path to model checkpoint directory |
+| `--text_encoder_path` | — | ✓ | — | Path to Gemma3 text encoder |
+| `--prompt` | ✓ | ✓ | — | Text prompt for generation |
+| `--negative_prompt` | — | ✓ | *(built-in)* | Negative prompt |
+| `--height` | ✓ | ✓ | ✓ | 1024 / 720 | Output height |
+| `--width` | ✓ | ✓ | ✓ | 1024 / 1280 | Output width |
+| `--num_frames` | — | ✓ | ✓ | 81 / 121 | Number of frames |
+| `--frame_rate` | — | ✓ | 24.0 | Output frame rate (fps) |
+| `--steps` | ✓ | ✓ | ✓ | 50 / 40 | Denoising steps |
+| `--guidance_scale` | ✓ | ✓ | ✓ | 3.5 / 5.0 / 4.0 | Guidance strength |
+| `--seed` | ✓ | ✓ | ✓ | 42 | Random seed |
+| `--image` | — | ✓ | None | Input image for image-to-video |
+| `--image_cond_strength` | — | ✓ | 1.0 | Image conditioning strength |
+| `--enable_teacache` | ✓ | ✓ | — | False | Cache optimization |
+| `--teacache_thresh` | ✓ | ✓ | — | 0.2 | TeaCache similarity threshold |
+| `--attention_backend` | ✓ | ✓ | — | VANILLA | `VANILLA`, `TRTLLM`, or `FA4` |
+| `--cfg_size` | — | ✓ | — | 1 | CFG parallelism |
+| `--ulysses_size` | ✓ | ✓ | — | 1 | Sequence parallelism |
+| `--linear_type` | ✓ | ✓ | — | default | Quantization type |
+| `--enhance_prompt` | — | ✓ | False | Gemma3 prompt enhancement |
+| `--stg_scale` | — | ✓ | 0.0 | Spatiotemporal guidance scale |
+| `--modality_scale` | — | ✓ | 1.0 | Cross-modal guidance scale |
+| `--rescale_scale` | — | ✓ | 0.0 | Variance-preserving rescale factor |
 
 ## Troubleshooting
 
@@ -182,6 +252,7 @@ python visual_gen_wan_i2v.py \
 
 - **FLUX**: `.png` (image)
 - **WAN**: `.mp4` if FFmpeg is installed, otherwise `.avi` (video)
+- **LTX2**: `.mp4` (video with audio) if FFmpeg is installed, otherwise `.avi` (video)
 
 ## Serving
 
 
@@ -42,6 +42,7 @@ Before running these examples, ensure you have:
    trtllm-serve $LLM_MODEL_DIR/Wan2.1-T2V-1.3B-Diffusers --extra_visual_gen_options ./configs/wan.yml
    trtllm-serve $LLM_MODEL_DIR/FLUX.1-dev --extra_visual_gen_options ./configs/flux1.yml
    trtllm-serve $LLM_MODEL_DIR/FLUX.2-dev --extra_visual_gen_options ./configs/flux2.yml
+   trtllm-serve $LLM_MODEL_DIR/LTX-2/ --extra_visual_gen_options ./configs/ltx2.yml
 
    # Run server on background:
    trtllm-serve $LLM_MODEL_DIR/Wan2.1-T2V-1.3B-Diffusers --extra_visual_gen_options ./configs/wan.yml > /tmp/serve.log 2>&1 &
@@ -50,6 +51,7 @@ Before running these examples, ensure you have:
    tail -f /tmp/serve.log
 
    ```
+   For LTX-2, you need to provide a proper text_encoder_path in `./configs/ltx2.yml`.
 
 ## Examples
 
@@ -58,6 +60,7 @@ Current supported & tested models:
 1. WAN T2V/I2V for video generation (t2v, ti2v, delete_video)
 2. FLUX.1 for image generation (t2i)
 3. FLUX.2 for image generation (t2i)
+4. LTX-2 for video generation with audio (t2v, ti2v)
 
 ### 1. Synchronous Image Generation (`sync_image_gen.py`)
 
@@ -118,14 +121,27 @@ python sync_video_gen.py --mode t2v \
     --prompt "A serene sunset over the ocean" \
     --duration 5.0 --fps 30 --size 512x512 \
     --output my_video.mp4
+
+# LTX-2: Text-to-Video (generates video with audio)
+python sync_video_gen.py --mode t2v \
+    --model ltx2 \
+    --prompt "A cute cat playing with a ball in the park" \
+    --duration 5.0 --fps 24 --size 1280x720
+
+# LTX-2: Image-to-Video
+python sync_video_gen.py --mode ti2v \
+    --model ltx2 \
+    --prompt "She turns around and smiles, then slowly walks out of the frame" \
+    --image ./media/woman_skyline_original_720p.jpeg \
+    --duration 5.0 --fps 24 --size 1280x720
 ```
 
 **Command-Line Arguments:**
 - `--mode` - Generation mode: `t2v` or `ti2v` (default: t2v)
 - `--prompt` - Text prompt for video generation (required)
 - `--image` - Path to reference image (required for ti2v mode)
 - `--base-url` - API server URL (default: http://localhost:8000/v1)
-- `--model` - Model name (default: wan)
+- `--model` - Model name (default: wan). Use `ltx2` for LTX-2.
 - `--duration` - Video duration in seconds (default: 4.0)
 - `--fps` - Frames per second (default: 24)
 - `--size` - Video resolution in WxH format (default: 256x256)
@@ -171,14 +187,27 @@ python async_video_gen.py --mode t2v \
     --prompt "A serene sunset over the ocean" \
     --duration 5.0 --fps 30 --size 512x512 \
     --output my_video.mp4
+
+# LTX-2: Async Text-to-Video (generates video with audio)
+python async_video_gen.py --mode t2v \
+    --model ltx2 \
+    --prompt "A cool cat on a motorcycle in the night" \
+    --duration 5.0 --fps 24 --size 1280x720
+
+# LTX-2: Async Image-to-Video
+python async_video_gen.py --mode ti2v \
+    --model ltx2 \
+    --prompt "She turns around and smiles, then slowly walks out of the frame" \
+    --image ./media/woman_skyline_original_720p.jpeg \
+    --duration 5.0 --fps 24 --size 1280x720
 ```
 
 **Command-Line Arguments:**
 - `--mode` - Generation mode: `t2v` or `ti2v` (default: t2v)
 - `--prompt` - Text prompt for video generation (required)
 - `--image` - Path to reference image (required for ti2v mode)
 - `--base-url` - API server URL (default: http://localhost:8000/v1)
-- `--model` - Model name (default: wan)
+- `--model` - Model name (default: wan). Use `ltx2` for LTX-2.
 - `--duration` - Video duration in seconds (default: 4.0)
 - `--fps` - Frames per second (default: 24)
 - `--size` - Video resolution in WxH format (default: 256x256)
@@ -249,13 +278,16 @@ You can customize these by:
 - `response_format`: "b64_json" or "url"
 
 ### Video Generation
-- `model`: Model identifier (e.g., "wan")
+- `model`: Model identifier (e.g., "wan", "ltx2")
 - `prompt`: Text description
-- `size`: Video resolution (e.g., "256x256", "512x512")
+- `size`: Video resolution (e.g., "256x256", "512x512", "1280x720")
 - `seconds`: Duration in seconds
 - `fps`: Frames per second
 - `input_reference`: Reference image file (for TI2V mode)
 
+> **Note:** LTX-2 generates video **with audio**. The `ltx2.yml` config must include
+> `text_encoder_path` pointing to a Gemma3 model (e.g., `google/gemma-3-12b-it`).
+
 ## Quick Reference - curl Examples
 
 ### Text-to-Video (JSON)
@@ -270,6 +302,19 @@ curl -X POST "http://localhost:8000/v1/videos" \
   }'
 ```
 
+### Text-to-Video with LTX-2 (JSON, generates video with audio)
+```bash
+curl -X POST "http://localhost:8000/v1/videos" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "ltx2",
+    "prompt": "A cool cat on a motorcycle",
+    "seconds": 5.0,
+    "fps": 24,
+    "size": "1280x720"
+  }'
+```
+
 ### Text+Image-to-Video (Multipart with File Upload)
 ```bash
 curl -X POST "http://localhost:8000/v1/videos" \
 
@@ -0,0 +1,8 @@
+text_encoder_path: google/gemma-3-12b-it
+linear:
+  type: default
+attention:
+   backend: VANILLA
+parallel:
+   dit_cfg_size: 1
+   dit_ulysses_size: 1