Feature Request : Optimization of Image2Image Pipeline for Video Workloads (Async Execution & Embedding Caching)

### Description

Currently, `ov::genai::Image2ImagePipeline` is optimized for single-image generation. When applied to video workloads (frame-by-frame style transfer), the current synchronous execution model creates significant performance bottlenecks that prevent real-time or near-real-time performance on Edge hardware (GPU/NPU).

I have successfully implemented a C++ **Video-to-Video Style Transfer** sample using `Image2ImagePipeline` (integrating OpenCV `VideoCapture` with OpenVINO GenAI) and have identified several architectural limitations that throttle throughput.

### Observed Bottlenecks
1.  **Synchronous Blocking:** The `pipe.generate()` call blocks the main thread completely. This prevents the application from performing parallel tasks, such as decoding the next video frame or encoding the previous one, creating a "stop-and-wait" execution pattern.

2.  **Redundant Text Encoding:** The text prompt (e.g., "cyberpunk anime style") is re-tokenized and re-encoded for *every single frame* of the video, despite the prompt r



emaining constant throughout the entire stream. This wastes CPU cycles.
 
3.  **Latent Initialization:** Latents are discarded and re-initialized from random noise for every frame. This ignores temporal continuity, which is crucial for video stability, and forces the model to "hallucinate" new details every frame (causing flickering).

4.  **Resource Contention:** Video I/O (OpenCV `read`/`write`) occurs on the same thread as the heavy inference load, leading to suboptimal resource utilization.

### Reproduction Context
I have validated this behavior with a working C++ baseline implementation:
* **Pipeline:** Video-to-Video Style Transfer (OpenCV + OpenVINO GenAI)
* **Model:** `dreamlike-anime-1.0` (OpenVINO IR)
* **Hardware:** Intel GPU (verified on iGPU/dGPU)
* **Resolution:** 512x512
* **Steps:** 15

**Baseline Code Snippet (Current Implementation):**
```cpp
// This loop runs synchronously, demonstrating the bottlenecks
while (true) {
    cap >> frame; // Video Decoding
    
    // ... Pre-processing (Resize/Color Convert) ...

    // BOTTLENECK 1: Prompt is re-encoded every iteration
    // BOTTLENECK 2: Execution blocks here, preventing parallel IO
    ov::Tensor output = pipe.generate(
        prompt, 
        input_tensor, 
        ov::genai::strength(0.6f), 
        ov::genai::num_inference_steps(15)
    );

    // ... Post-processing & Saving ...
}
```

**Proposed Enhancements**
To enable efficient Video Style Transfer on OpenVINO, I propose the following optimizations:

Prompt Embedding Caching: Introduce an API or internal mechanism to encode the prompt once and reuse the embeddings for subsequent generate() calls.

Async Inference Support: Expose an asynchronous API for generate() to allow the application to prepare the next frame (decoding/resizing) while the GPU is processing the current UNet steps.

Latent Reuse: Allow passing existing latents (or the previous frame's latents) to warm-start the generation, improving temporal consistency and potentially reducing step count.

**Impact**
Implementing these changes would make OpenVINO GenAI a viable solution for real-time video effects and AR filters on Edge devices, significantly distinguishing it from cloud-only pipelines.

I have a working baseline code ready and would be interested in contributing to these optimizations or submitting the Video Style Transfer sample as a reference implementation to the repository.

https://github.com/user-attachments/assets/517c28ae-535c-488e-882b-56fdf8c5402f

https://github.com/user-attachments/assets/dcd3c88e-1521-41e8-86b5-43c84c1c4299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request : Optimization of Image2Image Pipeline for Video Workloads (Async Execution & Embedding Caching) #3258

Description

Observed Bottlenecks

Reproduction Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request : Optimization of Image2Image Pipeline for Video Workloads (Async Execution & Embedding Caching) #3258

Description

Description

Observed Bottlenecks

Reproduction Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions