Skip to content

Feature Request : Optimization of Image2Image Pipeline for Video Workloads (Async Execution & Embedding Caching) #3258

@Ashitpatel001

Description

@Ashitpatel001

Description

Currently, ov::genai::Image2ImagePipeline is optimized for single-image generation. When applied to video workloads (frame-by-frame style transfer), the current synchronous execution model creates significant performance bottlenecks that prevent real-time or near-real-time performance on Edge hardware (GPU/NPU).

I have successfully implemented a C++ Video-to-Video Style Transfer sample using Image2ImagePipeline (integrating OpenCV VideoCapture with OpenVINO GenAI) and have identified several architectural limitations that throttle throughput.

Observed Bottlenecks

  1. Synchronous Blocking: The pipe.generate() call blocks the main thread completely. This prevents the application from performing parallel tasks, such as decoding the next video frame or encoding the previous one, creating a "stop-and-wait" execution pattern.

  2. Redundant Text Encoding: The text prompt (e.g., "cyberpunk anime style") is re-tokenized and re-encoded for every single frame of the video, despite the prompt r

emaining constant throughout the entire stream. This wastes CPU cycles.

  1. Latent Initialization: Latents are discarded and re-initialized from random noise for every frame. This ignores temporal continuity, which is crucial for video stability, and forces the model to "hallucinate" new details every frame (causing flickering).

  2. Resource Contention: Video I/O (OpenCV read/write) occurs on the same thread as the heavy inference load, leading to suboptimal resource utilization.

Reproduction Context

I have validated this behavior with a working C++ baseline implementation:

  • Pipeline: Video-to-Video Style Transfer (OpenCV + OpenVINO GenAI)
  • Model: dreamlike-anime-1.0 (OpenVINO IR)
  • Hardware: Intel GPU (verified on iGPU/dGPU)
  • Resolution: 512x512
  • Steps: 15

Baseline Code Snippet (Current Implementation):

// This loop runs synchronously, demonstrating the bottlenecks
while (true) {
    cap >> frame; // Video Decoding
    
    // ... Pre-processing (Resize/Color Convert) ...

    // BOTTLENECK 1: Prompt is re-encoded every iteration
    // BOTTLENECK 2: Execution blocks here, preventing parallel IO
    ov::Tensor output = pipe.generate(
        prompt, 
        input_tensor, 
        ov::genai::strength(0.6f), 
        ov::genai::num_inference_steps(15)
    );

    // ... Post-processing & Saving ...
}

Proposed Enhancements
To enable efficient Video Style Transfer on OpenVINO, I propose the following optimizations:

Prompt Embedding Caching: Introduce an API or internal mechanism to encode the prompt once and reuse the embeddings for subsequent generate() calls.

Async Inference Support: Expose an asynchronous API for generate() to allow the application to prepare the next frame (decoding/resizing) while the GPU is processing the current UNet steps.

Latent Reuse: Allow passing existing latents (or the previous frame's latents) to warm-start the generation, improving temporal consistency and potentially reducing step count.

Impact
Implementing these changes would make OpenVINO GenAI a viable solution for real-time video effects and AR filters on Edge devices, significantly distinguishing it from cloud-only pipelines.

I have a working baseline code ready and would be interested in contributing to these optimizations or submitting the Video Style Transfer sample as a reference implementation to the repository.

Screen.Recording.2026-01-31.225550.mp4
Screen.Recording.2026-01-31.195412.mp4

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions