-
Notifications
You must be signed in to change notification settings - Fork 544
Description
🚀 Feature Description and Motivation
Currently, Aibrix's ModelRouter routes requests at the Pod level, which means that requests are routed to an entire Pod rather than an individual server or inference service within the Pod. However, in scenarios where multiple inference services (such as DP or TP domains) are running within a single Pod, this approach does not offer the necessary granularity to effectively distribute load and ensure efficient utilization of resources.
Problem:
Pod-Level Routing: The ModelRouter only routes requests to the entire Pod, not individual inference servers within that Pod. This causes inefficiency in scenarios where there are multiple services within the same Pod (e.g., multiple GPUs or DP domains).
Multi-Server Pods: When a Pod contains multiple inference services (like DP domains or GPUs), routing requests to the entire Pod can lead to resource bottlenecks, inefficient load balancing, and suboptimal performance.
Lack of Granularity: There's no mechanism to route requests to specific servers within the Pod, resulting in potential overload of certain servers while others remain underutilized.
Use Case
In scenarios where a Pod contains two different DP domains or multiple GPUs, we want the ModelRouter to route requests to specific servers within the Pod:
Example Pod with two servers (DP domains):
Pod 1:
├── dp0: http://pod1:8000
├── dp1: http://pod1:8001
Currently, ModelRouter would route requests to Pod 1 as a whole, potentially leading to inefficiencies. With server-level routing, ModelRouter would be able to route requests directly to either dp0 or dp1 based on factors like load and health.
I would like to submit a request for the implementation of server-level routing within Aibrix's ModelRouter to address the limitations mentioned above.
Proposed Solution
I propose an enhancement to Aibrix's ModelRouter to support server-level routing. The key idea is to:
Track and register each inference server (e.g., dp0, dp1, etc.) within a Pod separately, enabling granular routing decisions.
Route requests to specific servers (rather than Pods) based on load, health, and other relevant factors.
Introduce health checks and failure management mechanisms at the server level to enhance fault tolerance and reliability.