Skip to content

[RFC]: AIBrix Multi-Modality: Best-in-Class Omni-Modal Serving Platform #1966

@Jeffwan

Description

@Jeffwan

Summary

vLLM-Omni already provides the multi-stage serving framework with pluggable connectors, Cache-DiT acceleration, and OpenAI-compatible APIs. AIBrix provides Kubernetes-native inference infrastructure with gateway routing, autoscaling, PrisKV KV cache store, and StormService orchestration.

This proposal integrates them deeply — making AIBrix the best platform to run multi-modal models powered by vLLM-Omni.

Motivation

Multi-modal AI is moving from research demos to production workloads. Applications now combine text chat, image generation, video generation, speech recognition (ASR), and text-to-speech (TTS) in a single user experience. Serving these workloads efficiently requires solving problems that neither standalone LLM inference nor single-model diffusion serving addresses:

  • Heterogeneous pipeline stages — An omni pipeline chains ASR (1.7B params, lightweight) → LLM (235B params, compute-heavy prefill, memory-bound decode) → DiT (burst GPU for diffusion steps) → TTS (real-time audio streaming). Each stage has fundamentally different resource profiles.
  • Inter-stage data transfer — KV caches, visual tokens, and audio embeddings must flow between stages with minimal latency. CPU-staged copies are a bottleneck.
  • Independent scaling — Image generation traffic spikes don't correlate with text chat load. Scaling the entire pipeline uniformly wastes GPUs.
  • GPU cost — A naive deployment dedicates one GPU per model. For a full omni pipeline (ASR + LLM + DiT + TTS), that's at least 4+ GPUs minimum, even when most models are idle most of the time.
  • Cloud-native orchestration — Production deployments run on Kubernetes. Ray adds a second distributed runtime on top of K8s, increasing operational complexity.

Proposed Change

TODO. A proposal will come soon

Alternatives Considered

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementNew feature or requestpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions