Skip to content

A video generation framework using HunyuanVideo and SVD for T2V, I2V, and V2V tasks with a Gradio interface.

License

Notifications You must be signed in to change notification settings

immortalshadow007/V1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Human-Centric Video Model


Welcome to the V1 repository! Here, you'll find the Text-to-Video, Image-to-Video & Video-to-Video model weights and inference code for the video model.

๐Ÿ“‘ TODO List

  • V1 (Text2Video Model)

    • Checkpoints
    • Inference Code
    • Web Demo (Gradio)
    • User-Level GPU Inference on RTX4090
    • Parallel Inference on Multi-GPUs
    • Prompt Rewrite && Prompt Guidance
    • CFG-distilled Model
    • Lite Model
    • 720P Version
    • ComfyUI
  • V1 (Image2Video Model)

    • Checkpoints
    • Inference Code
    • Web Demo (Gradio)
    • User-Level GPU Inference on RTX4090
    • Parallel Inference on Multi-GPUs
    • Prompt Rewrite && Prompt Guidance
    • CFG-distilled Model
    • Lite Model
    • 720P Version
    • ComfyUI
  • V1 (Video2Video Model)

    • Checkpoints
    • Inference Code
    • Web Demo (Gradio)
    • User-Level GPU Inference on RTX4090
    • Parallel Inference on Multi-GPUs
    • Prompt Rewrite && Prompt Guidance
    • CFG-distilled Model
    • Lite Model
    • 720P Version
    • ComfyUI

๐ŸŒŸ Overview

V1 is an advanced open-source human-centric video model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips, V1 offers three key advantages:

๐Ÿ”‘ Key Features

1. Advanced Model Capabilities

  1. Open-Source Leadership: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
  2. Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
  3. Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.

2. Self-Developed Data Cleaning and Annotation Pipeline

The model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.

  • Expression Classification: Categorizes human facial expressions into 33 distinct types.
  • Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
  • Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
  • Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots.

3. Multi-Stage Image-to-Video Pretraining

Our multi-stage pretraining pipeline, inspired by the HunyuanVideo design, consists of the following stages:

  • Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
  • Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
  • Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.

๐Ÿ“Š Benchmark Results

I evaluated the performance of text-to-video model using VBench, comparing it with other outstanding open-source models.

Based on the benchmark results, V1 demonstrates SOTA performance among open-source Text-to-Video (T2V) models. Specifically, our model achieves an overall score of 82.43, which is higher than other open-source models such as VideoCrafter-2.0 VEnhancer (82.24) and CogVideoX1.5-5B (82.17). Additionally, our model achieves the highest scores in several key metrics, including Dynamic Degree and Multiple Objects, indicating our model's superior ability to handle complex video generation tasks.

Models Overall Quality Score Semantic Score Image Quality Dynamic Degree Multiple Objects Spatial Relationship
OpenSora V1.3 77.23 80.14 65.62 56.21 30.28 43.58 51.61
AnimateDiff-V2 80.27 82.90 69.75 70.1 40.83 36.88 34.60
VideoCrafter-2.0 VEnhancer 82.24 83.54 77.06 65.35 63.89 68.84 57.55
CogVideoX1.5-5B 82.17 82.78 79.76 65.02 50.93 69.65 80.25
HunyuanVideo 540P 81.23 83.49 72.22 66.31 51.67 70.45 63.46
V1 540P (T2V) 82.43 84.62 73.68 67.15 72.5 71.61 70.83

๐Ÿ“ฆ Model Introduction

Model Name Resolution Video Length FPS
V1-Hunyuan-I2V 544px960p 97 24
V1-Hunyuan-T2V 544px960p 97 24
V1-SVD-V2V 544px960p 97 24

Infer Introduction

Infer is a highly efficient video generation inference framework that enables accurate and swift production of high-quality videos, making video generation inference significantly faster without any loss in quality.

Multi-GPU Inference Support: The framework accommodates Context Parallel, CFG Parallel, and VAE Parallel methodologies, facilitating rapid and lossless video production to meet the stringent low-latency demands of online environments.

User-Level GPU Deployment: By employing model quantization and parameter-level offload strategies, the system significantly reduces GPU memory requirements, catering to the needs of consumer-grade graphics cards with limited VRAM.

Superior Inference Performance: Demonstrating exceptional efficiency, the framework achieves a 58.3% reduction in end-to-end latency compared to HunyuanVideo XDiT, setting a new benchmark for inference speed.

Excellent Usability: Built upon the open-source framework Diffusers and featuring a non-intrusive parallel implementation approach, the system ensures a seamless and user-friendly experience.

๐Ÿ› ๏ธ Running Guide

Begin by cloning the repository:

git clone https://github.com/
cd infer

Installation Guide for Linux

We recommend Python 3.10 and CUDA version 12.2 for the manual installation.

# Install pip dependencies
pip install -r requirements.txt

When sufficient VRAM is available (e.g., on A800), the lossless version can be run directly.

Note: When generating videos, the prompt should start with "FPS-24, " as we referenced the controlling the fps training method from Moviegen during training.

Model = "V1-Hunyuan-T2V"
python3 video_generate.py \
    --model_id ${Model} \
    --task_type t2v \
    --guidance_scale 6.0 \
    --height 544 \
    --width 960 \
    --num_frames 97 \
    --prompt "FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool" \
    --embedded_guidance_scale 1.0

User-Level GPU Inference (RTX4090)

We list the height/width/frame settings we recommend in the following table.

Resolution h/w=9:16 h/w=16:9 h/w=1:1
544p 544px960px97f 960px544px97f 720px720px97f

Using Command Line

# \Model: If using i2v, switch to V1-Hunyuan-I2V.
# quant: Enable FP8 weight-only quantization
# offload: Enable offload model
# high_cpu_memory: Enable pinned memory to reduce the overhead of model offloading.
# parameters_level: Further reduce GPU VRAM usage.
# task_type:The task type is designated to support both t2v and i2v. For the execution of an i2v task, it is necessary to input --image.
Model = "V1-Hunyuan-T2V"
python3 video_generate.py \
    --model_id ${Model} \
    --task_type t2v \
    --guidance_scale 6.0 \
    --height 544 \
    --width 960 \
    --num_frames 97 \
    --prompt "FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool" \
    --embedded_guidance_scale 1.0 \
    --quant \
    --offload \
    --high_cpu_memory \
    --parameters_level

The example above shows generating a 544px960px97f 4s video on a single RTX 4090 with full VRAM optimization, peaking at 18.5G VRAM usage. At maximum VRAM capacity, a 544px960px289f 12s video can be produced (using --sequence_batch, taking ~1.5h on one RTX 4090; adding GPUs greatly reduces time).

๐Ÿš€ Parallel Inference on Multiple GPUs

# Model: If using i2v, switch to V1-Hunyuan-I2V.
# quant: Enable FP8 weight-only quantization
# offload: Enable offload model
# high_cpu_memory: Enable pinned memory to reduce the overhead of model offloading.
# gpu_num: Number of GPUs used.
Model = "V1-Hunyuan-T2V"
python3 video_generate.py \
    --model_id ${Model} \
    --guidance_scale 6.0 \
    --height 544 \
    --width 960 \
    --num_frames 97 \
    --prompt "FPS-24, A cat wearing sunglasses and working as a lifeguard at a pool" \
    --embedded_guidance_scale 1.0 \
    --quant \
    --offload \
    --high_cpu_memory \
    --gpu_num $GPU_NUM

Performance Comparison

This test aims to compare the end-to-end latency of Infer and HunyuanVideo XDiT for 544p video processing on both the A800 (high-performance computing GPU) and RTX 4090 (consumer-grade GPU). The results will demonstrate the superior inference performance of Infer in terms of speed and efficiency.

Testing Parameters

Resolution video size transformer step guidance_scale
540p 544px960px97f 30 6

Note: The comparisons below focus on T2V and I2V tasks (using the underlying model for T2V/I2V tasks). V2V performance (using Stable Video Diffusion) may differ and will be evaluated in a future update.

User-Level GPU Inference (RTX4090)

In practice, Hunyuanvideo XDIT cannot perform inference on the RTX 4090 due to insufficient VRAM. To address this issue, we implemented fixes based on the official offload, FP8 model weights, and VAE tiling. These include:
a) Optimizing the model loading and initialization logic to avoid fully loading the FP16 model into memory.
b) Reducing the VAE tiling size to alleviate memory usage. For the deployment of Infer on the RTX 4090, the following measures will be implemented to ensure sufficient VRAM availability and efficient inference:
a) Model Quantization: Apply FP8 weight-only quantization to ensure the model can be fully loaded into memory.
b) Offload Strategy: Enable parameter-level offloading to further reduce VRAM usage.
c) Multi-GPU Parallelism: Activate context parallelism, CFG parallelism, and VAE parallelism for distributed processing.
d) Computation Optimization: Optimize attention layer calculations using SegaAttn and enable Torch.Compile for transformer compilation optimization (supporting both 4-GPU and 8-GPU configurations).

GPU NUM hunyuanvideo + xdit Infer
1 VRAM OOM 889.31s
2 VRAM OOM 453.69s
4 464.3s 293.3s
8 Cannot split video sequence into ulysses_degree x ring_degree 159.43s

The table above summarizes the end-to-end latency test results for generating 544p 4-second videos on the RTX 4090 using HunyuanVideo XDIT and VideoInfer. The following conclusions can be drawn:

  • Under the same RTX 4090 resource conditions (4 GPUs), the Infer version reduces end-to-end latency by 58.3% compared to HunyuanVideo XDIT (293.3s vs. 464.3s).
  • The Infer version features a more robust deployment strategy, supporting inference deployment across 1 to 8 GPUs at the user level.

A800

Based on the A800 (80G), the primary testing focused on comparing the performance differences between HunyuanVideo XDIT and Infer without compromising output quality.

GPU NUM hunyuanvideo + xdit Infer
1 884.20s 771.03s
2 487.22s 387.01s
4 263.48s 205.49s
8 Cannot split video sequence into ulysses_degree x ring_degree 107.41s

The table above summarizes the end-to-end latency test results for generating 544p 4-second videos on the A800 using HunyuanVideo XDIT and VideoInfer. The following conclusions can be drawn:

Under the same A800 resource conditions, the Infer version reduces end-to-end latency by 14.7% to 28.2% compared to the official HunyuanVideo version.

The Infer version features a more robust multi-GPU deployment strategy.

๐Ÿ”’ Data Security

This model uses a fine-tuned version of Tencentโ€™s HunyuanVideo for T2V and I2V tasks (see licenses for details), and Stable Video Diffusion (SVD) for V2V tasks. The following data security considerations apply:

  • Local Inference: After an initial download of the model weights from Hugging Face (e.g., V1-Hunyuan-T2V), all inference happens locally on your machine. User inputs (prompts, images, videos) are not sent to external servers during runtime.
  • Initial Download: The model weights are downloaded from Hugging Face, which may expose metadata (e.g., IP address) to Hugging Face and potentially Tencent. Use a VPN or proxy during the initial download to mitigate this risk.
  • Telemetry Concerns: No telemetry was found in the inference code or the Tencent/HunyuanVideo repository. However, the model weights are serialized PyTorch .pt files, which could theoretically contain malicious code due to pickle vulnerabilities. To mitigate this, the framework uses weights_only=True when loading weights (ensure this is applied in video_infer.py).
  • Tencentโ€™s Involvement: As a Chinese company, Tencent is subject to Chinaโ€™s data security laws, which could allow government access to data if it were collected. Since inference is local, this risk is minimal, but users should be cautious about potential vulnerabilities in the weights.
  • Recommendations:
    • Monitor network activity during inference (e.g., using Wireshark) to confirm no data is sent externally.
    • Avoid processing sensitive or personal data with this framework.
    • Consider using alternative models (e.g., Stable Video Diffusion for all tasks) if data security is a critical concern.

Acknowledgements

I would like to thank the contributors of HunyuanVideo, ParaAttention, Diffusers, and Stability AI (for Stable Video Diffusion) for their open research and contributions. Additional contributors are acknowledged in the licenses.

About

A video generation framework using HunyuanVideo and SVD for T2V, I2V, and V2V tasks with a Gradio interface.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages