| title | Load Balancing |
|---|---|
| description | Multi-deployment routing strategies, failover, and circuit breakers |
| section | models |
| order | 2 |
Put multiple deployments behind a single model name. VoidLLM handles routing, failover, and health-aware traffic management.
models:
- name: gpt-4o
strategy: round-robin
max_retries: 2
aliases: [default]
deployments:
- name: azure-east
provider: azure
base_url: https://eastus.openai.azure.com
api_key: ${AZURE_EAST_KEY}
azure_deployment: my-gpt4o-east
- name: azure-west
provider: azure
base_url: https://westus.openai.azure.com
api_key: ${AZURE_WEST_KEY}
azure_deployment: my-gpt4o-west
- name: openai-direct
provider: openai
base_url: https://api.openai.com/v1
api_key: ${OPENAI_KEY}Your app sends model: "default". VoidLLM picks a deployment based on the strategy.
| Strategy | Description | Best for |
|---|---|---|
round-robin |
Equal distribution, next in rotation | Default, even load |
least-latency |
Routes to the deployment with lowest recent P50 | Latency-sensitive apps |
weighted |
Distribute by configured weight | Cost optimization (cheap provider gets more) |
priority |
Always use highest-priority, fall back when down | Primary/backup scenarios |
deployments:
- name: cheap-provider
weight: 3 # gets 75% of traffic
- name: premium-provider
weight: 1 # gets 25% of trafficdeployments:
- name: primary
priority: 1 # used first (lower = higher priority)
- name: backup
priority: 2 # used only when primary is downWhen a deployment returns 5xx, times out, or can't connect, VoidLLM retries on the next available deployment. The max_retries setting controls how many deployments to try before giving up.
This happens transparently - the client sees a normal response. Usage tracking records which deployment actually handled the request.
Each deployment has its own circuit breaker. After consecutive failures, the circuit opens and the deployment is temporarily skipped.
settings:
circuit_breaker:
enabled: true
threshold: 5 # failures before opening
timeout: 30s # how long to stay open
half_open_max: 1 # test requests before closingVoidLLM continuously probes each deployment's health. Unhealthy deployments are excluded from routing automatically. If all deployments are unhealthy, VoidLLM falls back to trying them anyway (better than returning nothing).
settings:
health_check:
health:
enabled: true
interval: 30s
functional:
enabled: true
interval: 5mThe functional probe sends a minimal request to each deployment (e.g., a 1-token completion) to verify end-to-end connectivity. The health probe checks basic reachability.
All deployments within a model should serve the same (or equivalent) LLM. VoidLLM sends the model name from the request to every upstream, so each deployment must recognize it.
This works for:
- Same model across regions (Azure East + West)
- Same model across providers (Azure + OpenAI direct)
This does not work for mixing different LLMs (e.g., GPT-4o + Llama 70B) because vLLM wouldn't recognize "gpt-4o" as a valid model name. Cross-model failover chains are on the Enterprise roadmap.
Load balancing, failover, circuit breakers, and health probing are all included in the free Community tier.