-
Notifications
You must be signed in to change notification settings - Fork 341
Description
Description
Currently, ov::genai::Image2ImagePipeline is optimized for single-image generation. When applied to video workloads (frame-by-frame style transfer), the current synchronous execution model creates significant performance bottlenecks that prevent real-time or near-real-time performance on Edge hardware (GPU/NPU).
I have successfully implemented a C++ Video-to-Video Style Transfer sample using Image2ImagePipeline (integrating OpenCV VideoCapture with OpenVINO GenAI) and have identified several architectural limitations that throttle throughput.
Observed Bottlenecks
-
Synchronous Blocking: The
pipe.generate()call blocks the main thread completely. This prevents the application from performing parallel tasks, such as decoding the next video frame or encoding the previous one, creating a "stop-and-wait" execution pattern. -
Redundant Text Encoding: The text prompt (e.g., "cyberpunk anime style") is re-tokenized and re-encoded for every single frame of the video, despite the prompt r
emaining constant throughout the entire stream. This wastes CPU cycles.
-
Latent Initialization: Latents are discarded and re-initialized from random noise for every frame. This ignores temporal continuity, which is crucial for video stability, and forces the model to "hallucinate" new details every frame (causing flickering).
-
Resource Contention: Video I/O (OpenCV
read/write) occurs on the same thread as the heavy inference load, leading to suboptimal resource utilization.
Reproduction Context
I have validated this behavior with a working C++ baseline implementation:
- Pipeline: Video-to-Video Style Transfer (OpenCV + OpenVINO GenAI)
- Model:
dreamlike-anime-1.0(OpenVINO IR) - Hardware: Intel GPU (verified on iGPU/dGPU)
- Resolution: 512x512
- Steps: 15
Baseline Code Snippet (Current Implementation):
// This loop runs synchronously, demonstrating the bottlenecks
while (true) {
cap >> frame; // Video Decoding
// ... Pre-processing (Resize/Color Convert) ...
// BOTTLENECK 1: Prompt is re-encoded every iteration
// BOTTLENECK 2: Execution blocks here, preventing parallel IO
ov::Tensor output = pipe.generate(
prompt,
input_tensor,
ov::genai::strength(0.6f),
ov::genai::num_inference_steps(15)
);
// ... Post-processing & Saving ...
}Proposed Enhancements
To enable efficient Video Style Transfer on OpenVINO, I propose the following optimizations:
Prompt Embedding Caching: Introduce an API or internal mechanism to encode the prompt once and reuse the embeddings for subsequent generate() calls.
Async Inference Support: Expose an asynchronous API for generate() to allow the application to prepare the next frame (decoding/resizing) while the GPU is processing the current UNet steps.
Latent Reuse: Allow passing existing latents (or the previous frame's latents) to warm-start the generation, improving temporal consistency and potentially reducing step count.
Impact
Implementing these changes would make OpenVINO GenAI a viable solution for real-time video effects and AR filters on Edge devices, significantly distinguishing it from cloud-only pipelines.
I have a working baseline code ready and would be interested in contributing to these optimizations or submitting the Video Style Transfer sample as a reference implementation to the repository.