When multiple parallel vision requests run simultaneously, they all hit the same processing phase at the same time - image encoding, image decoding, then text inference. This causes resource contention rather than efficient pipelining.
Current Behavior:
Parallel requests execute in lockstep through each phase
Image encoding, image decoding, and text inference all happen simultaneously across requests
Resources are overwhelmed at each phase rather than utilized continuously
Expected Behavior:
Parallel requests should be staggered so phases overlap
When one request is in image encoding, another could be in image decoding, another in token generation
Continuous utilization of both CPU and GPU across all parallel slots