[Model] support Ltx2 text-to-video image-to-video#841
[Model] support Ltx2 text-to-video image-to-video#841david6666666 wants to merge 17 commits intovllm-project:mainfrom
Conversation
cb1a09e to
3f3a885
Compare
5c4a679 to
72bb6c8
Compare
|
@ZJY0516 @SamitHuang @wtomin ptal, thx |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 346be1b2ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sp_size = getattr(self.od_config.parallel_config, "sequence_parallel_size", 1) | ||
| if sp_size > 1 and latent_length < sp_size: | ||
| pad_len = sp_size - latent_length | ||
| if latents is not None: | ||
| pad_shape = list(latents.shape) | ||
| pad_shape[2] = pad_len | ||
| padding = torch.zeros(pad_shape, dtype=latents.dtype, device=latents.device) | ||
| latents = torch.cat([latents, padding], dim=2) | ||
| latent_length = sp_size |
There was a problem hiding this comment.
Pad audio latents for sequence-parallel sharding
When sequence_parallel_size > 1, the LTX2 transformer shards audio_hidden_states with SequenceParallelInput (auto-pad is off), so the sequence length must be evenly divisible across ranks. Here prepare_audio_latents only pads when latent_length < sp_size, but it does nothing when latent_length is larger yet not divisible (e.g., default 121 frames @ 24fps → latent_length≈126, sp_size=4). That yields uneven shards and will fail during all‑gather or produce mismatched audio in SP runs. Consider padding latent_length up to the next multiple of sp_size (or enabling auto‑pad in the SP plan) instead of only handling the < sp_size case.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive support for the LTX-2 (Lightricks) text-to-video and image-to-video models with integrated audio generation capabilities, aligning with the diffusers library implementation (PR #12915).
Changes:
- Implements LTX2 text-to-video and image-to-video pipelines with joint audio generation
- Adds LTX2VideoTransformer3DModel with audio-video cross-attention blocks
- Integrates cache-dit support for LTX2 transformer blocks
- Extends example scripts to handle audio output alongside video frames
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/diffusion/models/ltx2/pipeline_ltx2.py | Core LTX2 text-to-video pipeline with audio generation support |
| vllm_omni/diffusion/models/ltx2/pipeline_ltx2_image2video.py | LTX2 image-to-video pipeline with conditioning mask and audio |
| vllm_omni/diffusion/models/ltx2/ltx2_transformer.py | Audio-visual transformer with a2v/v2a cross-attention blocks and RoPE |
| vllm_omni/diffusion/models/ltx2/init.py | Module exports for LTX2 components |
| vllm_omni/diffusion/registry.py | Registers LTX2 pipeline classes and post-processing functions |
| vllm_omni/diffusion/request.py | Adds audio_latents, frame_rate, output_type, and decode parameters |
| vllm_omni/diffusion/diffusion_engine.py | Extends engine to extract and route audio payloads from dict outputs |
| vllm_omni/entrypoints/omni_diffusion.py | Allows model_class_name override for custom pipeline selection |
| vllm_omni/entrypoints/async_omni_diffusion.py | Allows model_class_name override in async entrypoint |
| vllm_omni/diffusion/cache/cache_dit_backend.py | Adds cache-dit support for LTX2 transformer blocks |
| examples/offline_inference/text_to_video/text_to_video.py | Enhanced to handle LTX2 audio+video output and encode_video export |
| examples/offline_inference/text_to_video/text_to_video.md | Documents LTX2 usage example with frame_rate and audio_sample_rate |
| examples/offline_inference/image_to_video/image_to_video.py | Enhanced for LTX2 I2V with audio output and model class override |
Comments suppressed due to low confidence (1)
examples/offline_inference/text_to_video/text_to_video.py:100
- This assignment to 'parallel_config' is unnecessary as it is redefined before this value is used.
parallel_config = DiffusionParallelConfig(
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class LTX2ImageToVideoPipeline(LTX2Pipeline): | ||
| support_image_input = True |
There was a problem hiding this comment.
The LTX2ImageToVideoPipeline should also inherit from SupportAudioOutput and declare support_audio_output = True. Although it inherits support_image_input from the pattern in the codebase, it also produces audio output like its parent LTX2Pipeline.
Both class variables should be declared:
- support_image_input = True (already present)
- support_audio_output = True (missing)
And the class should inherit from both protocols:
class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput):
Note: Once LTX2Pipeline properly inherits from SupportAudioOutput, this class will inherit it automatically, but it's clearer to be explicit about all supported interfaces.
| class LTX2ImageToVideoPipeline(LTX2Pipeline): | |
| support_image_input = True | |
| class LTX2ImageToVideoPipeline(LTX2Pipeline, SupportAudioOutput): | |
| support_image_input = True | |
| support_audio_output = True |
| # Configure parallel settings (only SP is supported for Wan) | ||
| # Note: cfg_parallel and tensor_parallel are not implemented for Wan models | ||
| parallel_config = DiffusionParallelConfig( | ||
| ulysses_degree=args.ulysses_degree, | ||
| ring_degree=args.ring_degree, | ||
| ) |
There was a problem hiding this comment.
The parallel_config is defined twice with identical content (lines 100-103 and lines 107-110). This is redundant code duplication. Remove one of these duplicate blocks.
The comment also mentions "only SP is supported for Wan" which may not be accurate for all models in this script (e.g., LTX2).
| num_inference_steps=args.num_inference_steps, | ||
| num_frames=args.num_frames, | ||
| frame_rate=frame_rate, | ||
| enable_cpu_offload=True, |
There was a problem hiding this comment.
The enable_cpu_offload parameter is hardcoded to True in the generate call, but it should respect the command-line argument args.enable_cpu_offload. This overrides the user's choice and always enables CPU offloading.
Change to: enable_cpu_offload=args.enable_cpu_offload,
| enable_cpu_offload=True, | |
| enable_cpu_offload=args.enable_cpu_offload, |
| return mu | ||
|
|
||
|
|
||
| class LTX2Pipeline(nn.Module): |
There was a problem hiding this comment.
The LTX2Pipeline class should inherit from SupportAudioOutput and declare support_audio_output = True as a class variable. This is necessary for the diffusion engine to properly identify that this pipeline produces audio output and handle it correctly.
The pattern is established in other audio-producing pipelines like StableAudioPipeline (see vllm_omni/diffusion/models/stable_audio/pipeline_stable_audio.py:61). Without this, the supports_audio_output() check in diffusion_engine.py:32-36 will return False, causing audio output to be incorrectly handled.
Add the import: from vllm_omni.diffusion.models.interface import SupportAudioOutput
And update the class declaration to: class LTX2Pipeline(nn.Module, SupportAudioOutput):
Then add: support_audio_output = True as a class variable.
| width, | ||
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, |
There was a problem hiding this comment.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
| width, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| width, | |
| image=None, | |
| latents=None, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| **kwargs, |
| dtype: torch.dtype | None = None, | ||
| device: torch.device | None = None, | ||
| generator: torch.Generator | None = None, | ||
| latents: torch.Tensor | None = None, |
There was a problem hiding this comment.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.prepare_latents matches the call.
| latents: torch.Tensor | None = None, | |
| latents: torch.Tensor | None = None, | |
| *args: Any, | |
| **kwargs: Any, |
| def check_inputs( | ||
| self, | ||
| image, | ||
| height, | ||
| width, | ||
| prompt, | ||
| latents=None, | ||
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, | ||
| ): |
There was a problem hiding this comment.
This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| pass | |
| except Exception as exc: # noqa: BLE001 | |
| # If ring-parallel utilities are unavailable or misconfigured, | |
| # fall back to using the unsharded attention_mask. | |
| logger.debug( | |
| "Failed to shard attention mask for sequence parallelism; " | |
| "continuing without sharding: %s", | |
| exc, | |
| ) |
| @@ -2,11 +2,12 @@ | |||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | |||
|
|
|||
| """ | |||
There was a problem hiding this comment.
Update this model name in docs/models/supported_models.md, and if acceleration methods are applicable, update this model's name in docs/user_guide/diffusion/diffusion_acceleration.md and docs/user_guide/diffusion/parallelism_acceleration.md .
c2dc5df to
84e0305
Compare
| - `--vae_use_slicing`: Enable VAE slicing for memory optimization. | ||
| - `--vae_use_tiling`: Enable VAE tiling for memory optimization. | ||
| - `--cfg_parallel_size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](../../../docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel). | ||
| - `--tensor_parallel_size`: tensor parallel size (effective for models that support TP, e.g. LTX2). |
There was a problem hiding this comment.
how about other inference examples
|
|
||
|
|
||
| class LTX2VideoTransformer3DModel( | ||
| ModelMixin, ConfigMixin, AttentionMixin, FromOriginalModelMixin, PeftAdapterMixin, CacheMixin |
There was a problem hiding this comment.
Remove diffusers' Mixin classes, because they are not needed.
| torch.distributed.all_reduce(tensor) | ||
|
|
||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| x_dtype = x.dtype |
There was a problem hiding this comment.
In other models, like z_image and flux, they simply use the original vllm's RMSNorm layer.
I am wodering why TensorParallelRMSNorm is required in this model?
There was a problem hiding this comment.
add Notes
RMSNorm that computes stats across TP shards for q/k norm.
LTX2 uses qk_norm="rms_norm_across_heads" while Q/K are tensor-parallel
sharded. A local RMSNorm would compute statistics on only the local shard,
which changes the normalization when TP > 1. We all-reduce the squared
sum to match the global RMS across all heads.
|
|
||
| layers: list[nn.Module] = [ | ||
| ColumnParallelApproxGELU(dim, inner_dim, approximate="tanh", bias=bias), | ||
| nn.Dropout(dropout), |
There was a problem hiding this comment.
There is no dropout during inference. Perhaps using nn.identity will be better if we need a place holder
| return out.to(dtype=x_dtype) | ||
|
|
||
|
|
||
| class LTX2AudioVideoAttnProcessor: |
There was a problem hiding this comment.
Could we refactor this? It's a little messive now
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if supports_audio_output(self.od_config.model_class_name): | ||
| audio_payload = outputs[0] if len(outputs) == 1 else outputs | ||
| return [ | ||
| OmniRequestOutput.from_diffusion( | ||
| request_id=request_id, | ||
| images=[], | ||
| prompt=prompt, | ||
| metrics=metrics, | ||
| latents=output.trajectory_latents, | ||
| multimodal_output={"audio": audio_payload}, | ||
| final_output_type="audio", | ||
| ), | ||
| ] | ||
| else: | ||
| mm_output = {} | ||
| if audio_payload is not None: | ||
| mm_output["audio"] = audio_payload | ||
| return [ | ||
| OmniRequestOutput.from_diffusion( | ||
| request_id=request_id, | ||
| images=outputs, | ||
| prompt=prompt, | ||
| metrics=metrics, | ||
| latents=output.trajectory_latents, | ||
| multimodal_output=mm_output, | ||
| ), | ||
| ] |
There was a problem hiding this comment.
Logic inconsistency in audio handling. When supports_audio_output() returns False (line 119), the code falls through to line 133 where it tries to use audio_payload extracted from the dict at line 99. However, this means models that return audio via dict (like LTX2) would be classified as not supporting audio output (due to missing class attribute) but would still have their audio handled here. This creates confusion about which code path handles audio. Consider clarifying the distinction between models that return ONLY audio (audio_output=True, final_output_type="audio") vs models that return video+audio (using dict with both).
| sample (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`): | ||
| The hidden states output conditioned on the `encoder_hidden_states` input, representing the visual output | ||
| of the model. This is typically a video (spatiotemporal) output. | ||
| audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`): |
There was a problem hiding this comment.
Incomplete TODO in docstring. The shape documentation for audio_sample is incomplete with "TODO" placeholder. Should specify the actual shape, likely (batch_size, audio_channels, audio_length) or similar based on the LTX2 audio VAE output.
| audio_sample (`torch.Tensor` of shape `(batch_size, TODO)`): | |
| audio_sample (`torch.Tensor` of shape `(batch_size, audio_channels, audio_length)`): |
| ) | ||
|
|
||
|
|
||
| def _unwrap_request_tensor(value: Any) -> Any: | ||
| if isinstance(value, list): | ||
| return value[0] if value else None | ||
| return value | ||
|
|
||
|
|
||
| def _get_prompt_field(prompt: Any, key: str) -> Any: | ||
| if isinstance(prompt, str): | ||
| return None | ||
| value = prompt.get(key) | ||
| if value is None: | ||
| additional = prompt.get("additional_information") | ||
| if isinstance(additional, dict): | ||
| value = additional.get(key) | ||
| return _unwrap_request_tensor(value) | ||
|
|
||
|
|
There was a problem hiding this comment.
Code duplication: The helper functions _unwrap_request_tensor and _get_prompt_field are duplicated in both pipeline_ltx2.py (lines 86-100) and pipeline_ltx2_image2video.py (lines 30-44). These should be moved to a shared utility module to avoid maintenance issues and ensure consistency.
| ) | |
| def _unwrap_request_tensor(value: Any) -> Any: | |
| if isinstance(value, list): | |
| return value[0] if value else None | |
| return value | |
| def _get_prompt_field(prompt: Any, key: str) -> Any: | |
| if isinstance(prompt, str): | |
| return None | |
| value = prompt.get(key) | |
| if value is None: | |
| additional = prompt.get("additional_information") | |
| if isinstance(additional, dict): | |
| value = additional.get(key) | |
| return _unwrap_request_tensor(value) | |
| _unwrap_request_tensor, | |
| _get_prompt_field, | |
| ) |
| audio = None | ||
| if isinstance(frames, list): | ||
| frames = frames[0] if frames else None | ||
|
|
||
| # Check if it's an OmniRequestOutput | ||
| if hasattr(first_item, "final_output_type"): | ||
| if first_item.final_output_type != "image": | ||
| raise ValueError( | ||
| f"Unexpected output type '{first_item.final_output_type}', expected 'image' for video generation." | ||
| ) | ||
|
|
||
| # Pipeline mode: extract from nested request_output | ||
| if hasattr(first_item, "is_pipeline_output") and first_item.is_pipeline_output: | ||
| if isinstance(first_item.request_output, list) and len(first_item.request_output) > 0: | ||
| inner_output = first_item.request_output[0] | ||
| if isinstance(inner_output, OmniRequestOutput) and hasattr(inner_output, "images"): | ||
| frames = inner_output.images[0] if inner_output.images else None | ||
| if frames is None: | ||
| raise ValueError("No video frames found in output.") | ||
| # Diffusion mode: use direct images field | ||
| elif hasattr(first_item, "images") and first_item.images: | ||
| frames = first_item.images | ||
| if isinstance(frames, OmniRequestOutput): | ||
| if frames.final_output_type != "image": | ||
| raise ValueError( | ||
| f"Unexpected output type '{frames.final_output_type}', expected 'image' for video generation." | ||
| ) | ||
| if frames.multimodal_output and "audio" in frames.multimodal_output: | ||
| audio = frames.multimodal_output["audio"] | ||
| if frames.is_pipeline_output and frames.request_output is not None: | ||
| inner_output = frames.request_output | ||
| if isinstance(inner_output, list): | ||
| inner_output = inner_output[0] if inner_output else None | ||
| if isinstance(inner_output, OmniRequestOutput): | ||
| if inner_output.multimodal_output and "audio" in inner_output.multimodal_output: | ||
| audio = inner_output.multimodal_output["audio"] | ||
| frames = inner_output | ||
| if isinstance(frames, OmniRequestOutput): | ||
| if frames.images: | ||
| if len(frames.images) == 1 and isinstance(frames.images[0], tuple) and len(frames.images[0]) == 2: | ||
| frames, audio = frames.images[0] | ||
| elif len(frames.images) == 1 and isinstance(frames.images[0], dict): | ||
| audio = frames.images[0].get("audio") | ||
| frames = frames.images[0].get("frames") or frames.images[0].get("video") | ||
| else: | ||
| frames = frames.images | ||
| else: | ||
| raise ValueError("No video frames found in OmniRequestOutput.") | ||
|
|
||
| if isinstance(frames, list) and frames: | ||
| first_item = frames[0] | ||
| if isinstance(first_item, tuple) and len(first_item) == 2: | ||
| frames, audio = first_item | ||
| elif isinstance(first_item, dict): | ||
| audio = first_item.get("audio") | ||
| frames = first_item.get("frames") or first_item.get("video") | ||
| elif isinstance(first_item, list): | ||
| frames = first_item | ||
|
|
||
| if isinstance(frames, tuple) and len(frames) == 2: | ||
| frames, audio = frames | ||
| elif isinstance(frames, dict): | ||
| audio = frames.get("audio") | ||
| frames = frames.get("frames") or frames.get("video") | ||
|
|
||
| if frames is None: | ||
| raise ValueError("No video frames found in output.") |
There was a problem hiding this comment.
Complex and fragile output unpacking logic. Lines 227-275 contain deeply nested conditionals to extract frames and audio from various possible output formats. This is brittle and hard to maintain. Consider creating a dedicated helper function or class to standardize output format handling, possibly in a shared utility module. The same complex logic is also duplicated in image_to_video.py lines 303-351.
| if isinstance(raw_image, str): | ||
| raw_image = PIL.Image.open(raw_image).convert("RGB") |
There was a problem hiding this comment.
Potential security issue: File path from user input opened without validation. At line 350, if raw_image is a string, it's directly passed to PIL.Image.open() without any path validation or sanitization. This could allow path traversal attacks if user input isn't properly validated upstream. Consider adding path validation or restricting to safe directories.
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, |
There was a problem hiding this comment.
Overridden method signature does not match call, where it is passed too many arguments. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'image'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
Overridden method signature does not match call, where it is passed an argument named 'latents'. Overriding method method LTX2ImageToVideoPipeline.check_inputs matches the call.
| negative_prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| image=None, | |
| latents=None, | |
| **kwargs, |
| image, | ||
| height, | ||
| width, | ||
| prompt, | ||
| latents=None, | ||
| prompt_embeds=None, | ||
| negative_prompt_embeds=None, | ||
| prompt_attention_mask=None, | ||
| negative_prompt_attention_mask=None, |
There was a problem hiding this comment.
This method requires at least 5 positional arguments, whereas overridden LTX2Pipeline.check_inputs may be called with 4. This call correctly calls the base method, but does not match the signature of the overriding method.
| image, | |
| height, | |
| width, | |
| prompt, | |
| latents=None, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| prompt, | |
| height, | |
| width, | |
| image=None, | |
| latents=None, | |
| prompt_embeds=None, | |
| negative_prompt_embeds=None, | |
| prompt_attention_mask=None, | |
| negative_prompt_attention_mask=None, | |
| *args, | |
| **kwargs, |
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 69ee1fcc6e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| conditioning_mask[:, :, 0] = 1.0 | ||
|
|
||
| noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype) | ||
| latents = init_latents * conditioning_mask + noise * (1 - conditioning_mask) |
There was a problem hiding this comment.
Repeat I2V init latents per output before mixing with noise
prepare_latents computes batch_size as prompts * num_videos_per_prompt, but init_latents is built only from the input images (one latent per prompt). When batching multiple prompts with num_outputs_per_prompt > 1, init_latents has shape [num_prompts, ...] while conditioning_mask/noise use [num_prompts * num_outputs_per_prompt, ...], so init_latents * conditioning_mask cannot broadcast correctly and generation fails. This path needs to duplicate image latents per requested output (e.g., repeat-interleave by num_videos_per_prompt) before the blend.
Useful? React with 👍 / 👎.
Signed-off-by: David Chen <530634352@qq.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
support Ltx2 text-to-video image-to-video, refer to huggingface/diffusers#12915
Test Plan
t2v:
diffusers:
i2v:
diffusers:
Test Result
t2v:
ltx2_t2v_diff.mp4
i2v:
ltx2_i2v_diff.mp4
A100-80G height=256 width=384
cache-dit:
39s -> 26s
ulysses_degree 2:
39s -> 38s
ring_degree 2:
39s -> 38s
cfg 2:
39s -> 29s
tp 2:
39s -> 38s
Checklist
LTX-2
LTX-2 follow prs:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)