Skip to content

Feature Request: Stagger Parallel Vision Requests for Better Resource Utilization #1654

@HavenCTO

Description

@HavenCTO

When multiple parallel vision requests run simultaneously, they all hit the same processing phase at the same time - image encoding, image decoding, then text inference. This causes resource contention rather than efficient pipelining.

Current Behavior:

Parallel requests execute in lockstep through each phase
Image encoding, image decoding, and text inference all happen simultaneously across requests
Resources are overwhelmed at each phase rather than utilized continuously
Expected Behavior:

Parallel requests should be staggered so phases overlap
When one request is in image encoding, another could be in image decoding, another in token generation
Continuous utilization of both CPU and GPU across all parallel slots

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions