[Model] support Ltx2 text-to-video image-to-video by david6666666 · Pull Request #841 · vllm-project/vllm-omni

david6666666 · 2026-01-19T06:38:44Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support Ltx2 text-to-video image-to-video, refer to huggingface/diffusers#12915

Test Plan

t2v:

python text_to_video.py \
  --model "/workspace/models/Lightricks/LTX-2" \
  --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  --negative_prompt "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" \
  --height 512 --width 768 --num_frames 121 \
  --num_inference_steps 40 --guidance_scale 4.0 \
  --frame_rate 24 --fps 24 \
  --seed 0 \
  --enable-cpu-offload \
  --output ltx2_t2v_diff.mp4

diffusers:

import torch
from diffusers.pipelines.ltx2 import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video

generator = torch.Generator("cuda").manual_seed(0)
pipe = LTX2Pipeline.from_pretrained("/workspace/models/Lightricks/LTX-2", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A cinematic close-up of ocean waves at golden hour."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

frame_rate = 24.0
video, audio = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    output_type="np",
    generator=generator,
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_sample.mp4",
)

i2v:

python examples/offline_inference/image_to_video/image_to_video.py \
  --model "/workspace/models/Lightricks/LTX-2" \
  --model_class_name "LTX2ImageToVideoPipeline" \
  --image astronaut.jpg \
  --prompt "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot." \
  --negative_prompt "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static." \
  --height 512 --width 768 --num_frames 121 \
  --num_inference_steps 40 --guidance_scale 4.0 \
  --frame_rate 24 \
  --seed 0 \
  --output ltx2_i2v_diff.mp4

diffusers:

import torch
from diffusers.pipelines.ltx2 import LTX2ImageToVideoPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

generator = torch.Generator("cuda").manual_seed(0)
pipe = LTX2ImageToVideoPipeline.from_pretrained("/workspace/models/Lightricks/LTX-2", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

image = load_image(
    "./astronaut.jpg"
)
prompt = "An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot."
negative_prompt = "shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static."

frame_rate = 24.0
video, audio = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=40,
    guidance_scale=4.0,
    output_type="np",
    generator=generator,
    return_dict=False,
)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
    output_path="ltx2_i2v.mp4",
)

Test Result

t2v:

ltx2_t2v_diff.mp4

i2v:

ltx2_i2v_diff.mp4

A100-80G height=256 width=384
cache-dit:
39s -> 26s
ulysses_degree 2:
39s -> 38s
ring_degree 2:
39s -> 38s
cfg 2:
39s -> 29s
tp 2:
39s -> 38s

Checklist

LTX-2

LTX-2 follow prs:

two stages
text encoder Gemma 3: support TP (tensor parallelism)
perf optimization

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

david6666666 · 2026-01-27T09:34:03Z

@ZJY0516 @SamitHuang @wtomin ptal, thx

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 346be1b2ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-27T09:46:13Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

+        sp_size = getattr(self.od_config.parallel_config, "sequence_parallel_size", 1)
+        if sp_size > 1 and latent_length < sp_size:
+            pad_len = sp_size - latent_length
+            if latents is not None:
+                pad_shape = list(latents.shape)
+                pad_shape[2] = pad_len
+                padding = torch.zeros(pad_shape, dtype=latents.dtype, device=latents.device)
+                latents = torch.cat([latents, padding], dim=2)
+            latent_length = sp_size


Pad audio latents for sequence-parallel sharding

When sequence_parallel_size > 1, the LTX2 transformer shards audio_hidden_states with SequenceParallelInput (auto-pad is off), so the sequence length must be evenly divisible across ranks. Here prepare_audio_latents only pads when latent_length < sp_size, but it does nothing when latent_length is larger yet not divisible (e.g., default 121 frames @ 24fps → latent_length≈126, sp_size=4). That yields uneven shards and will fail during all‑gather or produce mismatched audio in SP runs. Consider padding latent_length up to the next multiple of sp_size (or enabling auto‑pad in the SP plan) instead of only handling the < sp_size case.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This pull request adds comprehensive support for the LTX-2 (Lightricks) text-to-video and image-to-video models with integrated audio generation capabilities, aligning with the diffusers library implementation (PR #12915).

Changes:

Implements LTX2 text-to-video and image-to-video pipelines with joint audio generation
Adds LTX2VideoTransformer3DModel with audio-video cross-attention blocks
Integrates cache-dit support for LTX2 transformer blocks
Extends example scripts to handle audio output alongside video frames

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py	Core LTX2 text-to-video pipeline with audio generation support
vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py	LTX2 image-to-video pipeline with conditioning mask and audio
vllm_omni/diffusion/models/ltx2/ltx2_transformer.py	Audio-visual transformer with a2v/v2a cross-attention blocks and RoPE
vllm_omni/diffusion/models/ltx2/init.py	Module exports for LTX2 components
vllm_omni/diffusion/registry.py	Registers LTX2 pipeline classes and post-processing functions
vllm_omni/diffusion/request.py	Adds audio_latents, frame_rate, output_type, and decode parameters
vllm_omni/diffusion/diffusion_engine.py	Extends engine to extract and route audio payloads from dict outputs
vllm_omni/entrypoints/omni_diffusion.py	Allows model_class_name override for custom pipeline selection
vllm_omni/entrypoints/async_omni_diffusion.py	Allows model_class_name override in async entrypoint
vllm_omni/diffusion/cache/cache_dit_backend.py	Adds cache-dit support for LTX2 transformer blocks
examples/offline_inference/text_to_video/text_to_video.py	Enhanced to handle LTX2 audio+video output and encode_video export
examples/offline_inference/text_to_video/text_to_video.md	Documents LTX2 usage example with frame_rate and audio_sample_rate
examples/offline_inference/image_to_video/image_to_video.py	Enhanced for LTX2 I2V with audio output and model class override

Comments suppressed due to low confidence (1)

examples/offline_inference/text_to_video/text_to_video.py:100

This assignment to 'parallel_config' is unnecessary as it is redefined before this value is used.

    parallel_config = DiffusionParallelConfig(

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-28T04:39:11Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+class LTX2ImageToVideoPipeline(LTX2Pipeline):
+    support_image_input = True


The LTX2ImageToVideoPipeline should also inherit from SupportAudioOutput and declare support_audio_output = True. Although it inherits support_image_input from the pattern in the codebase, it also produces audio output like its parent LTX2Pipeline.

Both class variables should be declared:

support_image_input = True (already present)

support_audio_output = True (missing)

And the class should inherit from both protocols:
class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):

Note: Once LTX2Pipeline properly inherits from SupportAudioOutput, this class will inherit it automatically, but it's clearer to be explicit about all supported interfaces.

Suggested change

class LTX2ImageToVideoPipeline(LTX2Pipeline):

support_image_input = True

class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):

support_image_input = True

support_audio_output = True

vllm_omni/entrypoints/async_omni_diffusion.py

Copilot · 2026-01-28T04:39:12Z

examples/offline_inference/text_to_video/text_to_video.py

+    # Configure parallel settings (only SP is supported for Wan)
+    # Note: cfg_parallel and tensor_parallel are not implemented for Wan models
+    parallel_config = DiffusionParallelConfig(
+        ulysses_degree=args.ulysses_degree,
+        ring_degree=args.ring_degree,
+    )


The parallel_config is defined twice with identical content (lines 100-103 and lines 107-110). This is redundant code duplication. Remove one of these duplicate blocks.

The comment also mentions "only SP is supported for Wan" which may not be accurate for all models in this script (e.g., LTX2).

Copilot · 2026-01-28T04:39:12Z

examples/offline_inference/text_to_video/text_to_video.py

        num_inference_steps=args.num_inference_steps,
        num_frames=args.num_frames,
+        frame_rate=frame_rate,
+        enable_cpu_offload=True,


The enable_cpu_offload parameter is hardcoded to True in the generate call, but it should respect the command-line argument args.enable_cpu_offload. This overrides the user's choice and always enables CPU offloading.

Change to: enable_cpu_offload=args.enable_cpu_offload,

Suggested change

enable_cpu_offload=True,

enable_cpu_offload=args.enable_cpu_offload,

Copilot · 2026-01-28T04:39:12Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

+    return mu
+
+
+class LTX2Pipeline(nn.Module):


The LTX2Pipeline class should inherit from SupportAudioOutput and declare support_audio_output = True as a class variable. This is necessary for the diffusion engine to properly identify that this pipeline produces audio output and handle it correctly.

The pattern is established in other audio-producing pipelines like StableAudioPipeline (see vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py:61). Without this, the supports_audio_output() check in diffusion_engine.py:32-36 will return False, causing audio output to be incorrectly handled.

Add the import: from vllm_omni.diffusion.models.interface import SupportAudioOutput
And update the class declaration to: class LTX2Pipeline(nn.Module, SupportAudioOutput):
Then add: support_audio_output = True as a class variable.

Copilot · 2026-01-28T04:39:12Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

+        width,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        prompt_attention_mask=None,
+        negative_prompt_attention_mask=None,


Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.

Suggested change

width,

prompt_embeds=None,

negative_prompt_embeds=None,

prompt_attention_mask=None,

negative_prompt_attention_mask=None,

width,

image=None,

latents=None,

prompt_embeds=None,

negative_prompt_embeds=None,

prompt_attention_mask=None,

negative_prompt_attention_mask=None,

**kwargs,

Copilot · 2026-01-28T04:39:12Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

+        dtype: torch.dtype | None = None,
+        device: torch.device | None = None,
+        generator: torch.Generator | None = None,
+        latents: torch.Tensor | None = None,


Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.prepare_latents matches the call.

Suggested change

latents: torch.Tensor | None = None,

latents: torch.Tensor | None = None,

*args: Any,

**kwargs: Any,

Copilot · 2026-01-28T04:39:13Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+    def check_inputs(
+        self,
+        image,
+        height,
+        width,
+        prompt,
+        latents=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        prompt_attention_mask=None,
+        negative_prompt_attention_mask=None,
+    ):


This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.

Copilot · 2026-01-28T04:39:13Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+                    except Exception:
+                        pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except Exception:

pass

except Exception as exc: # noqa: BLE001

# If ring-parallel utilities are unavailable or misconfigured,

# fall back to using the unsharded attention_mask.

logger.debug(

"Failed to shard attention mask for sequence parallelism; "

"continuing without sharding: %s",

exc,

)

wtomin · 2026-01-28T06:21:30Z

examples/offline_inference/image_to_video/image_to_video.py

@@ -2,11 +2,12 @@
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

 """


Update this model name in docs/models/supported_models.md, and if acceleration methods are applicable, update this model's name in docs/user_guide/diffusion/diffusion_acceleration.md and docs/user_guide/diffusion/parallelism_acceleration.md .

hsliuustc0106 · 2026-02-02T11:43:22Z

examples/offline_inference/image_to_video/README.md

 - `--vae_use_slicing`: Enable VAE slicing for memory optimization.
 - `--vae_use_tiling`: Enable VAE tiling for memory optimization.
 - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
+- `--tensor_parallel_size`: tensor parallel size (effective for models that support TP, e.g. LTX2).


how about other inference examples

wtomin · 2026-02-04T02:24:31Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+
+
+class LTX2VideoTransformer3DModel(
+    ModelMixin, ConfigMixin, AttentionMixin, FromOriginalModelMixin, PeftAdapterMixin, CacheMixin


Remove diffusers' Mixin classes, because they are not needed.

wtomin · 2026-02-04T03:05:55Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+            torch.distributed.all_reduce(tensor)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x_dtype = x.dtype


In other models, like z_image and flux, they simply use the original vllm's RMSNorm layer.

I am wodering why TensorParallelRMSNorm is required in this model?

add Notes
RMSNorm that computes stats across TP shards for q/k norm.
LTX2 uses qk_norm="rms_norm_across_heads" while Q/K are tensor-parallel
sharded. A local RMSNorm would compute statistics on only the local shard,
which changes the normalization when TP > 1. We all-reduce the squared
sum to match the global RMS across all heads.

ZJY0516 · 2026-02-04T06:45:59Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+
+        layers: list[nn.Module] = [
+            ColumnParallelApproxGELU(dim, inner_dim, approximate="tanh", bias=bias),
+            nn.Dropout(dropout),


There is no dropout during inference. Perhaps using nn.identity will be better if we need a place holder

ZJY0516 · 2026-02-04T06:50:08Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+        return out.to(dtype=x_dtype)
+
+
+class LTX2AudioVideoAttnProcessor:


Could we refactor this? It's a little messive now

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

Copilot · 2026-02-06T02:21:02Z

vllm_omni/diffusion/diffusion_engine.py

            if supports_audio_output(self.od_config.model_class_name):
                audio_payload = outputs[0] if len(outputs) == 1 else outputs
                return [
                    OmniRequestOutput.from_diffusion(
                        request_id=request_id,
                        images=[],
                        prompt=prompt,
                        metrics=metrics,
                        latents=output.trajectory_latents,
                        multimodal_output={"audio": audio_payload},
                        final_output_type="audio",
                    ),
                ]
            else:
+                mm_output = {}
+                if audio_payload is not None:
+                    mm_output["audio"] = audio_payload
                return [
                    OmniRequestOutput.from_diffusion(
                        request_id=request_id,
                        images=outputs,
                        prompt=prompt,
                        metrics=metrics,
                        latents=output.trajectory_latents,
+                        multimodal_output=mm_output,
                    ),
                ]


Logic inconsistency in audio handling. When supports_audio_output() returns False (line 119), the code falls through to line 133 where it tries to use audio_payload extracted from the dict at line 99. However, this means models that return audio via dict (like LTX2) would be classified as not supporting audio output (due to missing class attribute) but would still have their audio handled here. This creates confusion about which code path handles audio. Consider clarifying the distinction between models that return ONLY audio (audio_output=True, final_output_type="audio") vs models that return video+audio (using dict with both).

Copilot · 2026-02-06T02:21:02Z

vllm_omni/diffusion/models/ltx2/ltx2_transformer.py

+        sample (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
+            The hidden states output conditioned on the `encoder_hidden_states` input, representing the visual output
+            of the model. This is typically a video (spatiotemporal) output.
+        audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`):


Incomplete TODO in docstring. The shape documentation for audio_sample is incomplete with "TODO" placeholder. Should specify the actual shape, likely (batch_size, audio_channels, audio_length) or similar based on the LTX2 audio VAE output.

Suggested change

audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`):

audio_sample (`torch.Tensor` of shape `(batch_size, audio_channels, audio_length)`):

Copilot · 2026-02-06T02:21:02Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+)
+
+
+def _unwrap_request_tensor(value: Any) -> Any:
+    if isinstance(value, list):
+        return value[0] if value else None
+    return value
+
+
+def _get_prompt_field(prompt: Any, key: str) -> Any:
+    if isinstance(prompt, str):
+        return None
+    value = prompt.get(key)
+    if value is None:
+        additional = prompt.get("additional_information")
+        if isinstance(additional, dict):
+            value = additional.get(key)
+    return _unwrap_request_tensor(value)
+
+


Code duplication: The helper functions _unwrap_request_tensor and _get_prompt_field are duplicated in both pipeline_ltx2.py (lines 86-100) and pipeline_ltx2_image2video.py (lines 30-44). These should be moved to a shared utility module to avoid maintenance issues and ensure consistency.

Suggested change

)

def _unwrap_request_tensor(value: Any) -> Any:

if isinstance(value, list):

return value[0] if value else None

return value

def _get_prompt_field(prompt: Any, key: str) -> Any:

if isinstance(prompt, str):

return None

value = prompt.get(key)

if value is None:

additional = prompt.get("additional_information")

if isinstance(additional, dict):

value = additional.get(key)

return _unwrap_request_tensor(value)

_unwrap_request_tensor,

_get_prompt_field,

)

Copilot · 2026-02-06T02:21:02Z

examples/offline_inference/text_to_video/text_to_video.py

+    audio = None
+    if isinstance(frames, list):
+        frames = frames[0] if frames else None

-        # Check if it's an OmniRequestOutput
-        if hasattr(first_item, "final_output_type"):
-            if first_item.final_output_type != "image":
-                raise ValueError(
-                    f"Unexpected output type '{first_item.final_output_type}', expected 'image' for video generation."
-                )
-
-            # Pipeline mode: extract from nested request_output
-            if hasattr(first_item, "is_pipeline_output") and first_item.is_pipeline_output:
-                if isinstance(first_item.request_output, list) and len(first_item.request_output) > 0:
-                    inner_output = first_item.request_output[0]
-                    if isinstance(inner_output, OmniRequestOutput) and hasattr(inner_output, "images"):
-                        frames = inner_output.images[0] if inner_output.images else None
-                        if frames is None:
-                            raise ValueError("No video frames found in output.")
-            # Diffusion mode: use direct images field
-            elif hasattr(first_item, "images") and first_item.images:
-                frames = first_item.images
+    if isinstance(frames, OmniRequestOutput):
+        if frames.final_output_type != "image":
+            raise ValueError(
+                f"Unexpected output type '{frames.final_output_type}', expected 'image' for video generation."
+            )
+        if frames.multimodal_output and "audio" in frames.multimodal_output:
+            audio = frames.multimodal_output["audio"]
+        if frames.is_pipeline_output and frames.request_output is not None:
+            inner_output = frames.request_output
+            if isinstance(inner_output, list):
+                inner_output = inner_output[0] if inner_output else None
+            if isinstance(inner_output, OmniRequestOutput):
+                if inner_output.multimodal_output and "audio" in inner_output.multimodal_output:
+                    audio = inner_output.multimodal_output["audio"]
+                frames = inner_output
+        if isinstance(frames, OmniRequestOutput):
+            if frames.images:
+                if len(frames.images) == 1 and isinstance(frames.images[0], tuple) and len(frames.images[0]) == 2:
+                    frames, audio = frames.images[0]
+                elif len(frames.images) == 1 and isinstance(frames.images[0], dict):
+                    audio = frames.images[0].get("audio")
+                    frames = frames.images[0].get("frames") or frames.images[0].get("video")
+                else:
+                    frames = frames.images
            else:
                raise ValueError("No video frames found in OmniRequestOutput.")

+    if isinstance(frames, list) and frames:
+        first_item = frames[0]
+        if isinstance(first_item, tuple) and len(first_item) == 2:
+            frames, audio = first_item
+        elif isinstance(first_item, dict):
+            audio = first_item.get("audio")
+            frames = first_item.get("frames") or first_item.get("video")
+        elif isinstance(first_item, list):
+            frames = first_item
+
+    if isinstance(frames, tuple) and len(frames) == 2:
+        frames, audio = frames
+    elif isinstance(frames, dict):
+        audio = frames.get("audio")
+        frames = frames.get("frames") or frames.get("video")
+
+    if frames is None:
+        raise ValueError("No video frames found in output.")


Complex and fragile output unpacking logic. Lines 227-275 contain deeply nested conditionals to extract frames and audio from various possible output formats. This is brittle and hard to maintain. Consider creating a dedicated helper function or class to standardize output format handling, possibly in a shared utility module. The same complex logic is also duplicated in image_to_video.py lines 303-351.

Copilot · 2026-02-06T02:21:02Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+                if isinstance(raw_image, str):
+                    raw_image = PIL.Image.open(raw_image).convert("RGB")


Potential security issue: File path from user input opened without validation. At line 350, if raw_image is a string, it's directly passed to PIL.Image.open() without any path validation or sanitization. This could allow path traversal attacks if user input isn't properly validated upstream. Consider adding path validation or restricting to safe directories.

Copilot · 2026-02-06T02:21:03Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py

+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        prompt_attention_mask=None,
+        negative_prompt_attention_mask=None,


Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.

Suggested change

negative_prompt_attention_mask=None,

negative_prompt_attention_mask=None,

image=None,

latents=None,

**kwargs,

Copilot · 2026-02-06T02:21:03Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+        image,
+        height,
+        width,
+        prompt,
+        latents=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+        prompt_attention_mask=None,
+        negative_prompt_attention_mask=None,


This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.

Suggested change

image,

height,

width,

prompt,

latents=None,

prompt_embeds=None,

negative_prompt_embeds=None,

prompt_attention_mask=None,

negative_prompt_attention_mask=None,

prompt,

height,

width,

image=None,

latents=None,

prompt_embeds=None,

negative_prompt_embeds=None,

prompt_attention_mask=None,

negative_prompt_attention_mask=None,

*args,

**kwargs,

Signed-off-by: David Chen <530634352@qq.com>

david6666666 · 2026-02-06T03:26:38Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69ee1fcc6e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-06T03:31:06Z

vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py

+        conditioning_mask[:, :, 0] = 1.0
+
+        noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        latents = init_latents * conditioning_mask + noise * (1 - conditioning_mask)


Repeat I2V init latents per output before mixing with noise

prepare_latents computes batch_size as prompts * num_videos_per_prompt, but init_latents is built only from the input images (one latent per prompt). When batching multiple prompts with num_outputs_per_prompt > 1, init_latents has shape [num_prompts, ...] while conditioning_mask/noise use [num_prompts * num_outputs_per_prompt, ...], so init_latents * conditioning_mask cannot broadcast correctly and generation fails. This path needs to duplicate image latents per requested output (e.g., repeat-interleave by num_videos_per_prompt) before the blend.

Useful? React with 👍 / 👎.

Signed-off-by: David Chen <530634352@qq.com>

david6666666 force-pushed the ltx2 branch 6 times, most recently from cb1a09e to 3f3a885 Compare January 21, 2026 09:17

david6666666 added this to the v0.14.0 milestone Jan 26, 2026

david6666666 force-pushed the ltx2 branch 4 times, most recently from 5c4a679 to 72bb6c8 Compare January 27, 2026 08:59

david6666666 marked this pull request as ready for review January 27, 2026 09:33

david6666666 requested a review from hsliuustc0106 as a code owner January 27, 2026 09:33

chatgpt-codex-connector bot reviewed Jan 27, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot January 28, 2026 04:33

david6666666 mentioned this pull request Jan 28, 2026

vLLM-Omni Model Support #808

Open

51 tasks

Copilot started reviewing on behalf of hsliuustc0106 January 28, 2026 04:33 View session

Copilot AI reviewed Jan 28, 2026

View reviewed changes

wtomin reviewed Jan 28, 2026

View reviewed changes

david6666666 removed this from the v0.14.0 milestone Jan 28, 2026

david6666666 linked an issue Jan 29, 2026 that may be closed by this pull request

[New Model]: Lightricks/LTX-2 #674

Open

1 task

david6666666 force-pushed the ltx2 branch 6 times, most recently from c2dc5df to 84e0305 Compare February 2, 2026 09:17

hsliuustc0106 requested a review from Copilot February 2, 2026 11:42

hsliuustc0106 reviewed Feb 2, 2026

View reviewed changes

david6666666 force-pushed the ltx2 branch from 131d295 to 6fbb1c2 Compare February 3, 2026 12:44

wtomin reviewed Feb 4, 2026

View reviewed changes

david6666666 requested review from SamitHuang and ZJY0516 February 4, 2026 06:34

ZJY0516 reviewed Feb 4, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot February 6, 2026 02:15

Copilot started reviewing on behalf of hsliuustc0106 February 6, 2026 02:15 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

david6666666 added 16 commits February 6, 2026 11:15

add ltx2

d9fa3ff

Signed-off-by: David Chen <530634352@qq.com>

adapt vllm-project#797

93b5771

Signed-off-by: David Chen <530634352@qq.com>

fix example

9bb9828

Signed-off-by: David Chen <530634352@qq.com>

support cfg

8cc912c

Signed-off-by: David Chen <530634352@qq.com>

support tp

7138824

Signed-off-by: David Chen <530634352@qq.com>

support tp2

92ca6c6

Signed-off-by: David Chen <530634352@qq.com>

support tp3

eae3c9e

Signed-off-by: David Chen <530634352@qq.com>

support tp4

a6f0e8d

Signed-off-by: David Chen <530634352@qq.com>

modify cache-dit config

6bb0e3d

Signed-off-by: David Chen <530634352@qq.com>

fix pre-commit and comment

47705ae

Signed-off-by: David Chen <530634352@qq.com>

fix codex

36bdbab

Signed-off-by: David Chen <530634352@qq.com>

fix comment 1

3f4d125

Signed-off-by: David Chen <530634352@qq.com>

fix comment 2

df9686c

Signed-off-by: David Chen <530634352@qq.com>

fix pre-commit

cb7093d

Signed-off-by: David Chen <530634352@qq.com>

fix comment ZJY

9318f1b

Signed-off-by: David Chen <530634352@qq.com>

fix bug1

69ee1fc

Signed-off-by: David Chen <530634352@qq.com>

david6666666 force-pushed the ltx2 branch from 837f796 to 69ee1fc Compare February 6, 2026 03:17

chatgpt-codex-connector bot reviewed Feb 6, 2026

View reviewed changes

fix comment

af231b3

Signed-off-by: David Chen <530634352@qq.com>

david6666666 added the ready label to trigger buildkite CI label Feb 6, 2026

		class LTX2ImageToVideoPipeline(LTX2Pipeline):
		support_image_input = True

	enable_cpu_offload=True,
	enable_cpu_offload=args.enable_cpu_offload,

-                    except Exception:
-                        pass
+                    except Exception as exc:  # noqa: BLE001
+                        # If ring-parallel utilities are unavailable or misconfigured,
+                        # fall back to using the unsharded attention_mask.
+                        logger.debug(
+                            "Failed to shard attention mask for sequence parallelism; "
+                            "continuing without sharding: %s",
+                            exc,
+                        )

		@@ -2,11 +2,12 @@
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

		"""



		class LTX2VideoTransformer3DModel(
		ModelMixin, ConfigMixin, AttentionMixin, FromOriginalModelMixin, PeftAdapterMixin, CacheMixin

		return out.to(dtype=x_dtype)


		class LTX2AudioVideoAttnProcessor:

	audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`):
	audio_sample (`torch.Tensor` of shape `(batch_size, audio_channels, audio_length)`):

		if isinstance(raw_image, str):
		raw_image = PIL.Image.open(raw_image).convert("RGB")

Conversation

david6666666 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Checklist

Uh oh!

david6666666 commented Jan 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david6666666 Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Jan 19, 2026 •

edited

Loading

david6666666 Feb 4, 2026 •

edited

Loading