Skip to content

Runtime-Driven ONNX Export for Diffusion Pipelines#118

Open
naomili0924 wants to merge 8 commits intohuggingface:mainfrom
naomili0924:text_to_video_ort_pipeline
Open

Runtime-Driven ONNX Export for Diffusion Pipelines#118
naomili0924 wants to merge 8 commits intohuggingface:mainfrom
naomili0924:text_to_video_ort_pipeline

Conversation

@naomili0924
Copy link
Copy Markdown
Contributor

@naomili0924 naomili0924 commented Feb 7, 2026

Motivation

Previously, exporting a text-to-video (or similar diffusion) pipeline to ONNX required:

  • Writing a dedicated OnnxConfig
  • Manually defining dummy inputs
  • Manually specifying dynamic axes
  • Hardcoding architecture-specific dimensions
  • Submitting a new PR for every new model architecture

I tried to export text-to-video pipelines find time-consuming.This approach does not scale for rapidly evolving diffusion pipelines.

This PR introduces a runtime-driven export mechanism integrated into ORTPipelineForText2Video.
Instead of relying on handcrafted OnnxConfig classes, export now works by:
Inference-Based Dummy Input Tracing

inf_kwargs = {
    "prompt": prompt,
    "negative_prompt": negative_prompt,
    "height": 240,
    "width": 416,
    "num_frames": 21,
    "guidance_scale": 5.0
}

The model is executed:
output = model(**inf_kwargs).frames[0]

Dummy inputs are derived directly from real inference execution.
This ensures:

  • Correct shapes
  • Valid input signatures
  • No manual dummy tensor construction

Config-Guided Dynamic Axis Estimation
Dynamic axes are estimated using:
module_arch_fields = { "text_encoder": ["d_model", "vocab_size"], "transformer": ["in_channels", "text_dim"], "vae_decoder": ["base_dim", "z_dim"], "vae_encoder": ["base_dim", "z_dim"], }

Instead of hardcoding shapes inside custom OnnxConfig classes,selected architectural fields from the model config are used to resolve dimensions.

This allows the exporter to:

  • Adapt to different architectures
  • Avoid per-model export logic
  • Maintain generality across pipelines

Design Principle
✅ One Implementation → Multiple Pipelines
With this design, a single implementation successfully exports multiple text-to-video pipelines without requiring architecture-specific OnnxConfig classes.

Successfully exported and validated:

  • Wan-AI / Wan2.1-T2V-1.3B-Diffusers
  • hunyuanvideo-community / HunyuanVideo-1.5-Diffusers-720p_t2v

Both pipelines were exported using the same runtime-driven mechanism:

  • Dummy inputs traced from actual inference
  • Dynamic axes resolved via module_arch_fields
  • No custom per-architecture ONNX config required

This demonstrates that the approach generalizes across different diffusion architectures.

The following pipelines were tested but could not be exported due to upstream loading/runtime issues in DiffusionPipeline.from_pretrained:

  • THUDM / CogVideoX
  • genmoai / Mochi

The export logic itself does not appear to be the limiting factor.
The failure occurs during pipeline initialization, likely due to:

  • Incomplete or inconsistent model card configuration
  • Missing components in the Diffusers config
  • Mismatch between pipeline class and model metadata

Future Work:
1. Unify symbolic dynamic axis naming
Avoid defining dynamic axes independently per module and ensure consistent symbolic naming across components.

2. Model dynamic shape constraints
Handle dependent dimensions (e.g., a + b, 2 * frames) safely.
Without explicit constraints, changing dynamic inputs may break graphs where derived dimensions are used internally.

3. Add export equivalence validation
Compare PyTorch and ONNX Runtime outputs to ensure structural and numerical consistency after export.

To export Hunyuan:

import torch
from diffusers.utils import export_to_video

from optimum.onnxruntime.modeling_diffusion import ORTPipelineForText2Video

model_id = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v"
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

device = "cuda:0"
seed=1

prompt = "A cat walks on the grass, realistic"
generator = torch.Generator(device=device).manual_seed(seed)
num_frames=5
num_inference_steps=50

inf_kwargs = {
    "prompt": prompt,
    "generator": generator,
    "num_frames": num_frames,
    "num_inference_steps": num_inference_steps,
}

module_arch_fields = {
    "text_encoder": ["hidden_size", "vocab_size"],
    "text_encoder_2": ["d_model", "vocab_size"],
    "transformer": ["in_channels", "out_channels","text_embed_dim", "text_embed_2_dim", "image_embed_dim"],
    "vae_encoder": ["in_channels", "out_channels", "latent_channels"],
    "vae_decoder": ["in_channels", "out_channels", "latent_channels"],
}

pipe = ORTPipelineForText2Video.from_pretrained(
    model_id,
    provider=providers[0],  # Force GPU
    torch_dtype=torch.float16,
    inf_kwargs = inf_kwargs,
    module_arch_fields = module_arch_fields,
)

output = pipe(**inf_kwargs).frames[0]
export_to_video(output, "output.mp4", fps=15)

To export Wan:


import torch
from diffusers.utils import export_to_video

from optimum.onnxruntime.modeling_diffusion import ORTPipelineForText2Video

wan_list = [
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    "ali-vilab/text-to-video-ms-1.7b",
]

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

prompt = "A cat walks on the grass, realistic"
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

inf_kwargs = {
    "prompt": prompt,
    "negative_prompt": negative_prompt,
    "height": 240,
    "width":416,
    "num_frames": 21,
    "guidance_scale": 5.0
}

module_arch_fields = {
    "text_encoder": ["d_model", "vocab_size"],
    "transformer": ["in_channels", "text_dim"],
    "vae_decoder": ["base_dim", "z_dim"],
    "vae_encoder": ["base_dim", "z_dim"],
}


pipe = ORTPipelineForText2Video.from_pretrained(
    wan_list[0],
    provider=providers[0],  # Force GPU
    torch_dtype=torch.float16,
    inf_kwargs = inf_kwargs,
    module_arch_fields = module_arch_fields,
    export_by_inference=True,
)

print("Loaded successfully on:", pipe.device)
prompt = "A cat walks on the grass grass grass"
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(**inf_kwargs).frames[0]
export_to_video(output, "output.mp4", fps=15)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread optimum/exporters/onnx/__main__.py Outdated
original_task = task
task = TasksManager.map_from_synonym(task)

print(inf_kwargs)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging lines?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, it's not finished yet.

@naomili0924 naomili0924 marked this pull request as draft February 16, 2026 23:55
@naomili0924 naomili0924 force-pushed the text_to_video_ort_pipeline branch 3 times, most recently from dc208fd to 21d667c Compare February 17, 2026 06:57
@naomili0924 naomili0924 force-pushed the text_to_video_ort_pipeline branch from 21d667c to d244942 Compare February 17, 2026 07:23
@naomili0924 naomili0924 force-pushed the text_to_video_ort_pipeline branch 2 times, most recently from 2258fc4 to effdab0 Compare February 22, 2026 03:45
@naomili0924 naomili0924 changed the title introduce text_to_video ort pipeline Runtime-Driven ONNX Export for Diffusion Pipelines Feb 22, 2026
@naomili0924 naomili0924 force-pushed the text_to_video_ort_pipeline branch 5 times, most recently from 33abf74 to 125212c Compare February 22, 2026 07:36
@naomili0924 naomili0924 force-pushed the text_to_video_ort_pipeline branch from 125212c to 5ccd6cd Compare February 22, 2026 07:39
@naomili0924 naomili0924 requested a review from xadupre February 22, 2026 07:45
@naomili0924 naomili0924 marked this pull request as ready for review February 22, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants