Skip to content

Commit 2eee701

Browse files
[TRTLLM-10617][feat] LTX-2 Model Support (#12009)
Signed-off-by: Yibin Li <109242046+yibinl-nvidia@users.noreply.github.com>
1 parent 5cc0ccd commit 2eee701

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+9991
-81
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1442,7 +1442,7 @@ repos:
14421442
additional_dependencies:
14431443
- tomli
14441444
# add ignore words list
1445-
args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
1445+
args: ["-L", "Mor,ans,thirdparty,subtiles,PARD,pard,therefrom", "--skip", "ATTRIBUTIONS-*.md,*.svg", "--skip", "security_scanning/*", "--skip", "tensorrt_llm/_torch/visual_gen/jit_kernels/*"]
14461446
exclude: 'scripts/attribution/data/cas/.*$'
14471447
- repo: https://github.com/astral-sh/ruff-pre-commit
14481448
rev: v0.9.4

LICENSE

Lines changed: 396 additions & 0 deletions
Large diffs are not rendered by default.

docs/source/models/visual-generation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ TensorRT-LLM **VisualGen** provides a unified inference stack for diffusion mode
3030
| `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | Image-to-Video |
3131
| `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | Text-to-Video |
3232
| `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | Image-to-Video |
33+
| `Lightricks/LTX-Video` | Text-to-Video (with Audio), Image-to-Video (with Audio) |
3334

34-
Models are auto-detected from the `model_index.json` file in the checkpoint directory. The `AutoPipeline` registry selects the appropriate pipeline class automatically.
35+
Models are auto-detected from the checkpoint directory. Diffusers-format models are detected via `model_index.json`; LTX-2 monolithic safetensors checkpoints are detected via embedded metadata. The `AutoPipeline` registry selects the appropriate pipeline class automatically.
3536

3637
### Feature Matrix
3738

@@ -41,6 +42,7 @@ Models are auto-detected from the `model_index.json` file in the checkpoint dire
4142
| **FLUX.2** | Yes | Yes | Yes | No [^1] | Yes | No | Yes | Yes | Yes |
4243
| **Wan 2.1** | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
4344
| **Wan 2.2** | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes |
45+
| **LTX-2** | Yes | Yes | No | Yes | Yes | No | No | Yes | Yes |
4446

4547
[^1]: FLUX models use embedded guidance and do not have a separate negative prompt path, so CFG parallelism is not applicable.
4648

examples/visual_gen/README.md

Lines changed: 85 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -139,22 +139,92 @@ python visual_gen_wan_i2v.py \
139139
```
140140

141141

142+
## LTX2 (Text/Image-to-Video with Audio)
143+
144+
LTX2 generates video **with audio** from text prompts or input images.
145+
It uses a Gemma3 text encoder (provided separately via `--text_encoder_path`)
146+
and supports BF16, FP8, and FP4 precision checkpoints.
147+
148+
Please refer to tensorrt_llm/_torch/visual_gen/models/ltx2/LTX_2_CHECKPOINT_FORMAT.md for model checkpoint info.
149+
150+
### Basic Usage
151+
152+
**Text-to-Video (single GPU):**
153+
```bash
154+
python visual_gen_ltx2.py \
155+
--model_path ${MODEL_ROOT}/LTX-2-checkpoint/ \
156+
--text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
157+
--prompt "A cute cat playing piano" \
158+
--height 720 --width 1280 --num_frames 121 \
159+
--steps 40 --guidance_scale 4.0 --seed 42 \
160+
--output_path output_t2v.mp4
161+
```
162+
163+
**Image-to-Video:**
164+
```bash
165+
python visual_gen_ltx2.py \
166+
--model_path ${MODEL_ROOT}/LTX-2-checkpoint/ \
167+
--text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
168+
--prompt "A cute cat playing piano" \
169+
--image ${PROJECT_ROOT}/examples/visual_gen/cat_piano.png \
170+
--image_cond_strength 1.0 \
171+
--height 720 --width 1280 --num_frames 121 \
172+
--steps 40 --seed 42 \
173+
--output_path output_i2v.mp4
174+
```
175+
176+
### Precision Variants
177+
178+
LTX2 ships checkpoints at three precision levels. Simply point `--model_path` at the
179+
appropriate directory:
180+
181+
```bash
182+
# FP8
183+
python visual_gen_ltx2.py \
184+
--model_path ${MODEL_ROOT}/LTX-2-checkpoint/fp8/ \
185+
--text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
186+
--prompt "A cute cat playing piano" \
187+
--height 720 --width 1280 --num_frames 121 \
188+
--output_path output_fp8.mp4
189+
190+
# FP4
191+
python visual_gen_ltx2.py \
192+
--model_path ${MODEL_ROOT}/LTX-2-checkpoint/fp4/ \
193+
--text_encoder_path ${MODEL_ROOT}/gemma-3-12b-it \
194+
--prompt "A cute cat playing piano" \
195+
--height 512 --width 768 --num_frames 121 \
196+
--output_path output_fp4.mp4
197+
```
198+
199+
---
200+
142201
## Common Arguments
143202

144-
| Argument | FLUX | WAN | Default | Description |
145-
|----------|------|-----|---------|-------------|
146-
| `--height` ||| 1024 / 720 | Output height |
147-
| `--width` ||| 1024 / 1280 | Output width |
148-
| `--num_frames` ||| 81 | Number of frames |
149-
| `--steps` ||| 50 | Denoising steps |
150-
| `--guidance_scale` ||| 3.5 / 5.0 | Guidance strength |
151-
| `--seed` ||| 42 | Random seed |
152-
| `--enable_teacache` ||| False | Cache optimization |
153-
| `--teacache_thresh` ||| 0.2 | TeaCache similarity threshold |
154-
| `--attention_backend` ||| VANILLA | `VANILLA`, `TRTLLM`, or `FA4` |
155-
| `--cfg_size` ||| 1 | CFG parallelism |
156-
| `--ulysses_size` ||| 1 | Sequence parallelism |
157-
| `--linear_type` ||| default | Quantization type |
203+
| Argument | FLUX | WAN | LTX2 | Default | Description |
204+
|----------|------|-----|------|---------|-------------|
205+
| `--model_path` |||| Path to model checkpoint directory |
206+
| `--text_encoder_path` |||| Path to Gemma3 text encoder |
207+
| `--prompt` |||| Text prompt for generation |
208+
| `--negative_prompt` ||| *(built-in)* | Negative prompt |
209+
| `--height` |||| 1024 / 720 | Output height |
210+
| `--width` |||| 1024 / 1280 | Output width |
211+
| `--num_frames` |||| 81 / 121 | Number of frames |
212+
| `--frame_rate` ||| 24.0 | Output frame rate (fps) |
213+
| `--steps` |||| 50 / 40 | Denoising steps |
214+
| `--guidance_scale` |||| 3.5 / 5.0 / 4.0 | Guidance strength |
215+
| `--seed` |||| 42 | Random seed |
216+
| `--image` ||| None | Input image for image-to-video |
217+
| `--image_cond_strength` ||| 1.0 | Image conditioning strength |
218+
| `--enable_teacache` |||| False | Cache optimization |
219+
| `--teacache_thresh` |||| 0.2 | TeaCache similarity threshold |
220+
| `--attention_backend` |||| VANILLA | `VANILLA`, `TRTLLM`, or `FA4` |
221+
| `--cfg_size` |||| 1 | CFG parallelism |
222+
| `--ulysses_size` |||| 1 | Sequence parallelism |
223+
| `--linear_type` |||| default | Quantization type |
224+
| `--enhance_prompt` ||| False | Gemma3 prompt enhancement |
225+
| `--stg_scale` ||| 0.0 | Spatiotemporal guidance scale |
226+
| `--modality_scale` ||| 1.0 | Cross-modal guidance scale |
227+
| `--rescale_scale` ||| 0.0 | Variance-preserving rescale factor |
158228

159229
## Troubleshooting
160230

@@ -182,6 +252,7 @@ python visual_gen_wan_i2v.py \
182252

183253
- **FLUX**: `.png` (image)
184254
- **WAN**: `.mp4` if FFmpeg is installed, otherwise `.avi` (video)
255+
- **LTX2**: `.mp4` (video with audio) if FFmpeg is installed, otherwise `.avi` (video)
185256

186257
## Serving
187258

examples/visual_gen/serve/README.md

Lines changed: 49 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ Before running these examples, ensure you have:
4242
trtllm-serve $LLM_MODEL_DIR/Wan2.1-T2V-1.3B-Diffusers --extra_visual_gen_options ./configs/wan.yml
4343
trtllm-serve $LLM_MODEL_DIR/FLUX.1-dev --extra_visual_gen_options ./configs/flux1.yml
4444
trtllm-serve $LLM_MODEL_DIR/FLUX.2-dev --extra_visual_gen_options ./configs/flux2.yml
45+
trtllm-serve $LLM_MODEL_DIR/LTX-2/ --extra_visual_gen_options ./configs/ltx2.yml
4546

4647
# Run server on background:
4748
trtllm-serve $LLM_MODEL_DIR/Wan2.1-T2V-1.3B-Diffusers --extra_visual_gen_options ./configs/wan.yml > /tmp/serve.log 2>&1 &
@@ -50,6 +51,7 @@ Before running these examples, ensure you have:
5051
tail -f /tmp/serve.log
5152

5253
```
54+
For LTX-2, you need to provide a proper text_encoder_path in `./configs/ltx2.yml`.
5355

5456
## Examples
5557

@@ -58,6 +60,7 @@ Current supported & tested models:
5860
1. WAN T2V/I2V for video generation (t2v, ti2v, delete_video)
5961
2. FLUX.1 for image generation (t2i)
6062
3. FLUX.2 for image generation (t2i)
63+
4. LTX-2 for video generation with audio (t2v, ti2v)
6164

6265
### 1. Synchronous Image Generation (`sync_image_gen.py`)
6366

@@ -118,14 +121,27 @@ python sync_video_gen.py --mode t2v \
118121
--prompt "A serene sunset over the ocean" \
119122
--duration 5.0 --fps 30 --size 512x512 \
120123
--output my_video.mp4
124+
125+
# LTX-2: Text-to-Video (generates video with audio)
126+
python sync_video_gen.py --mode t2v \
127+
--model ltx2 \
128+
--prompt "A cute cat playing with a ball in the park" \
129+
--duration 5.0 --fps 24 --size 1280x720
130+
131+
# LTX-2: Image-to-Video
132+
python sync_video_gen.py --mode ti2v \
133+
--model ltx2 \
134+
--prompt "She turns around and smiles, then slowly walks out of the frame" \
135+
--image ./media/woman_skyline_original_720p.jpeg \
136+
--duration 5.0 --fps 24 --size 1280x720
121137
```
122138

123139
**Command-Line Arguments:**
124140
- `--mode` - Generation mode: `t2v` or `ti2v` (default: t2v)
125141
- `--prompt` - Text prompt for video generation (required)
126142
- `--image` - Path to reference image (required for ti2v mode)
127143
- `--base-url` - API server URL (default: http://localhost:8000/v1)
128-
- `--model` - Model name (default: wan)
144+
- `--model` - Model name (default: wan). Use `ltx2` for LTX-2.
129145
- `--duration` - Video duration in seconds (default: 4.0)
130146
- `--fps` - Frames per second (default: 24)
131147
- `--size` - Video resolution in WxH format (default: 256x256)
@@ -171,14 +187,27 @@ python async_video_gen.py --mode t2v \
171187
--prompt "A serene sunset over the ocean" \
172188
--duration 5.0 --fps 30 --size 512x512 \
173189
--output my_video.mp4
190+
191+
# LTX-2: Async Text-to-Video (generates video with audio)
192+
python async_video_gen.py --mode t2v \
193+
--model ltx2 \
194+
--prompt "A cool cat on a motorcycle in the night" \
195+
--duration 5.0 --fps 24 --size 1280x720
196+
197+
# LTX-2: Async Image-to-Video
198+
python async_video_gen.py --mode ti2v \
199+
--model ltx2 \
200+
--prompt "She turns around and smiles, then slowly walks out of the frame" \
201+
--image ./media/woman_skyline_original_720p.jpeg \
202+
--duration 5.0 --fps 24 --size 1280x720
174203
```
175204

176205
**Command-Line Arguments:**
177206
- `--mode` - Generation mode: `t2v` or `ti2v` (default: t2v)
178207
- `--prompt` - Text prompt for video generation (required)
179208
- `--image` - Path to reference image (required for ti2v mode)
180209
- `--base-url` - API server URL (default: http://localhost:8000/v1)
181-
- `--model` - Model name (default: wan)
210+
- `--model` - Model name (default: wan). Use `ltx2` for LTX-2.
182211
- `--duration` - Video duration in seconds (default: 4.0)
183212
- `--fps` - Frames per second (default: 24)
184213
- `--size` - Video resolution in WxH format (default: 256x256)
@@ -249,13 +278,16 @@ You can customize these by:
249278
- `response_format`: "b64_json" or "url"
250279

251280
### Video Generation
252-
- `model`: Model identifier (e.g., "wan")
281+
- `model`: Model identifier (e.g., "wan", "ltx2")
253282
- `prompt`: Text description
254-
- `size`: Video resolution (e.g., "256x256", "512x512")
283+
- `size`: Video resolution (e.g., "256x256", "512x512", "1280x720")
255284
- `seconds`: Duration in seconds
256285
- `fps`: Frames per second
257286
- `input_reference`: Reference image file (for TI2V mode)
258287

288+
> **Note:** LTX-2 generates video **with audio**. The `ltx2.yml` config must include
289+
> `text_encoder_path` pointing to a Gemma3 model (e.g., `google/gemma-3-12b-it`).
290+
259291
## Quick Reference - curl Examples
260292

261293
### Text-to-Video (JSON)
@@ -270,6 +302,19 @@ curl -X POST "http://localhost:8000/v1/videos" \
270302
}'
271303
```
272304

305+
### Text-to-Video with LTX-2 (JSON, generates video with audio)
306+
```bash
307+
curl -X POST "http://localhost:8000/v1/videos" \
308+
-H "Content-Type: application/json" \
309+
-d '{
310+
"model": "ltx2",
311+
"prompt": "A cool cat on a motorcycle",
312+
"seconds": 5.0,
313+
"fps": 24,
314+
"size": "1280x720"
315+
}'
316+
```
317+
273318
### Text+Image-to-Video (Multipart with File Upload)
274319
```bash
275320
curl -X POST "http://localhost:8000/v1/videos" \
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
text_encoder_path: google/gemma-3-12b-it
2+
linear:
3+
type: default
4+
attention:
5+
backend: VANILLA
6+
parallel:
7+
dit_cfg_size: 1
8+
dit_ulysses_size: 1

0 commit comments

Comments
 (0)