Runtime-Driven ONNX Export for Diffusion Pipelines by naomili0924 · Pull Request #118 · huggingface/optimum-onnx

naomili0924 · 2026-02-07T14:05:55Z

Motivation

Previously, exporting a text-to-video (or similar diffusion) pipeline to ONNX required:

Writing a dedicated OnnxConfig
Manually defining dummy inputs
Manually specifying dynamic axes
Hardcoding architecture-specific dimensions
Submitting a new PR for every new model architecture

I tried to export text-to-video pipelines find time-consuming.This approach does not scale for rapidly evolving diffusion pipelines.

This PR introduces a runtime-driven export mechanism integrated into ORTPipelineForText2Video.
Instead of relying on handcrafted OnnxConfig classes, export now works by:
Inference-Based Dummy Input Tracing

inf_kwargs = {
    "prompt": prompt,
    "negative_prompt": negative_prompt,
    "height": 240,
    "width": 416,
    "num_frames": 21,
    "guidance_scale": 5.0
}

The model is executed:
output = model(**inf_kwargs).frames[0]

Dummy inputs are derived directly from real inference execution.
This ensures:

Correct shapes
Valid input signatures
No manual dummy tensor construction

Config-Guided Dynamic Axis Estimation
Dynamic axes are estimated using:
module_arch_fields = { "text_encoder": ["d_model", "vocab_size"], "transformer": ["in_channels", "text_dim"], "vae_decoder": ["base_dim", "z_dim"], "vae_encoder": ["base_dim", "z_dim"], }

Instead of hardcoding shapes inside custom OnnxConfig classes,selected architectural fields from the model config are used to resolve dimensions.

This allows the exporter to:

Adapt to different architectures
Avoid per-model export logic
Maintain generality across pipelines

Design Principle
✅ One Implementation → Multiple Pipelines
With this design, a single implementation successfully exports multiple text-to-video pipelines without requiring architecture-specific OnnxConfig classes.

Successfully exported and validated:

Wan-AI / Wan2.1-T2V-1.3B-Diffusers
hunyuanvideo-community / HunyuanVideo-1.5-Diffusers-720p_t2v

Both pipelines were exported using the same runtime-driven mechanism:

Dummy inputs traced from actual inference
Dynamic axes resolved via module_arch_fields
No custom per-architecture ONNX config required

This demonstrates that the approach generalizes across different diffusion architectures.

The following pipelines were tested but could not be exported due to upstream loading/runtime issues in DiffusionPipeline.from_pretrained:

THUDM / CogVideoX
genmoai / Mochi

The export logic itself does not appear to be the limiting factor.
The failure occurs during pipeline initialization, likely due to:

Incomplete or inconsistent model card configuration
Missing components in the Diffusers config
Mismatch between pipeline class and model metadata

Future Work:
1. Unify symbolic dynamic axis naming
Avoid defining dynamic axes independently per module and ensure consistent symbolic naming across components.

2. Model dynamic shape constraints
Handle dependent dimensions (e.g., a + b, 2 * frames) safely.
Without explicit constraints, changing dynamic inputs may break graphs where derived dimensions are used internally.

3. Add export equivalence validation
Compare PyTorch and ONNX Runtime outputs to ensure structural and numerical consistency after export.

To export Hunyuan:

import torch
from diffusers.utils import export_to_video

from optimum.onnxruntime.modeling_diffusion import ORTPipelineForText2Video

model_id = "hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v"
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

device = "cuda:0"
seed=1

prompt = "A cat walks on the grass, realistic"
generator = torch.Generator(device=device).manual_seed(seed)
num_frames=5
num_inference_steps=50

inf_kwargs = {
    "prompt": prompt,
    "generator": generator,
    "num_frames": num_frames,
    "num_inference_steps": num_inference_steps,
}

module_arch_fields = {
    "text_encoder": ["hidden_size", "vocab_size"],
    "text_encoder_2": ["d_model", "vocab_size"],
    "transformer": ["in_channels", "out_channels","text_embed_dim", "text_embed_2_dim", "image_embed_dim"],
    "vae_encoder": ["in_channels", "out_channels", "latent_channels"],
    "vae_decoder": ["in_channels", "out_channels", "latent_channels"],
}

pipe = ORTPipelineForText2Video.from_pretrained(
    model_id,
    provider=providers[0],  # Force GPU
    torch_dtype=torch.float16,
    inf_kwargs = inf_kwargs,
    module_arch_fields = module_arch_fields,
)

output = pipe(**inf_kwargs).frames[0]
export_to_video(output, "output.mp4", fps=15)

To export Wan:


import torch
from diffusers.utils import export_to_video

from optimum.onnxruntime.modeling_diffusion import ORTPipelineForText2Video

wan_list = [
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    "ali-vilab/text-to-video-ms-1.7b",
]

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

prompt = "A cat walks on the grass, realistic"
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

inf_kwargs = {
    "prompt": prompt,
    "negative_prompt": negative_prompt,
    "height": 240,
    "width":416,
    "num_frames": 21,
    "guidance_scale": 5.0
}

module_arch_fields = {
    "text_encoder": ["d_model", "vocab_size"],
    "transformer": ["in_channels", "text_dim"],
    "vae_decoder": ["base_dim", "z_dim"],
    "vae_encoder": ["base_dim", "z_dim"],
}


pipe = ORTPipelineForText2Video.from_pretrained(
    wan_list[0],
    provider=providers[0],  # Force GPU
    torch_dtype=torch.float16,
    inf_kwargs = inf_kwargs,
    module_arch_fields = module_arch_fields,
    export_by_inference=True,
)

print("Loaded successfully on:", pipe.device)
prompt = "A cat walks on the grass grass grass"
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(**inf_kwargs).frames[0]
export_to_video(output, "output.mp4", fps=15)

HuggingFaceDocBuilderDev · 2026-02-10T15:26:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xadupre · 2026-02-16T10:12:14Z

    original_task = task
    task = TasksManager.map_from_synonym(task)

+    print(inf_kwargs)


Debugging lines?

sorry, it's not finished yet.

introduce text_to_video ort pipeline

f89e82b

extract dynamic inputs and outputs dictionary

27e7d1e

xadupre reviewed Feb 16, 2026

View reviewed changes

naomili0924 marked this pull request as draft February 16, 2026 23:55

add speical onnx config and input generate

2a2baee

naomili0924 force-pushed the text_to_video_ort_pipeline branch 3 times, most recently from dc208fd to 21d667c Compare February 17, 2026 06:57

fix patch

d244942

naomili0924 force-pushed the text_to_video_ort_pipeline branch from 21d667c to d244942 Compare February 17, 2026 07:23

naomili0924 added 3 commits February 20, 2026 06:40

fix io binding

f400241

clean up framework

40aa8b1

test hunyuan

ef85583

naomili0924 force-pushed the text_to_video_ort_pipeline branch 2 times, most recently from 2258fc4 to effdab0 Compare February 22, 2026 03:45

naomili0924 changed the title ~~introduce text_to_video ort pipeline~~ Runtime-Driven ONNX Export for Diffusion Pipelines Feb 22, 2026

naomili0924 force-pushed the text_to_video_ort_pipeline branch 5 times, most recently from 33abf74 to 125212c Compare February 22, 2026 07:36

support hunyuan

5ccd6cd

naomili0924 force-pushed the text_to_video_ort_pipeline branch from 125212c to 5ccd6cd Compare February 22, 2026 07:39

naomili0924 requested a review from xadupre February 22, 2026 07:45

naomili0924 marked this pull request as ready for review February 22, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime-Driven ONNX Export for Diffusion Pipelines#118

Runtime-Driven ONNX Export for Diffusion Pipelines#118
naomili0924 wants to merge 8 commits intohuggingface:mainfrom
naomili0924:text_to_video_ort_pipeline

naomili0924 commented Feb 7, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2026

Uh oh!

xadupre Feb 16, 2026

Uh oh!

naomili0924 Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

naomili0924 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2026

Uh oh!

xadupre Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

naomili0924 Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

naomili0924 commented Feb 7, 2026 •

edited

Loading