-
Notifications
You must be signed in to change notification settings - Fork 537
Description
Summary
vLLM-Omni already provides the multi-stage serving framework with pluggable connectors, Cache-DiT acceleration, and OpenAI-compatible APIs. AIBrix provides Kubernetes-native inference infrastructure with gateway routing, autoscaling, PrisKV KV cache store, and StormService orchestration.
This proposal integrates them deeply — making AIBrix the best platform to run multi-modal models powered by vLLM-Omni.
Motivation
Multi-modal AI is moving from research demos to production workloads. Applications now combine text chat, image generation, video generation, speech recognition (ASR), and text-to-speech (TTS) in a single user experience. Serving these workloads efficiently requires solving problems that neither standalone LLM inference nor single-model diffusion serving addresses:
- Heterogeneous pipeline stages — An omni pipeline chains ASR (1.7B params, lightweight) → LLM (235B params, compute-heavy prefill, memory-bound decode) → DiT (burst GPU for diffusion steps) → TTS (real-time audio streaming). Each stage has fundamentally different resource profiles.
- Inter-stage data transfer — KV caches, visual tokens, and audio embeddings must flow between stages with minimal latency. CPU-staged copies are a bottleneck.
- Independent scaling — Image generation traffic spikes don't correlate with text chat load. Scaling the entire pipeline uniformly wastes GPUs.
- GPU cost — A naive deployment dedicates one GPU per model. For a full omni pipeline (ASR + LLM + DiT + TTS), that's at least 4+ GPUs minimum, even when most models are idle most of the time.
- Cloud-native orchestration — Production deployments run on Kubernetes. Ray adds a second distributed runtime on top of K8s, increasing operational complexity.
Proposed Change
TODO. A proposal will come soon
Alternatives Considered
No response