Summary
cloud-api surfaces backend exhaustion as HTTP 529 (CompletionError::ServiceOverloaded → status_overloaded() in crates/api/src/routes/common.rs:13). OpenRouter's provider uptime calculation excludes 429 but counts every 500+ response (including 529) as downtime. So our overload signal, which we deliberately chose as "retry with backoff," is read by OpenRouter as a hard outage and degrades our routing rank:
- 95%+ uptime → normal routing
- 80–94% → lower priority
- <80% → fallback-only
429 (per-org concurrent-slot exhaustion, try_acquire_concurrent_slot in crates/services/src/completions/mod.rs:1016) is correctly excluded by OpenRouter and is exactly the "back off" signal OR wants. OpenRouter explicitly asks providers to return early 429s under load and to not queue.
This does NOT block the OpenRouter listing — it affects post-listing routing rank, not eligibility.
Current behavior (confirmed in code)
| Condition |
Domain error |
HTTP |
OR treats as |
Per-(org,model) concurrent cap hit (DEFAULT_CONCURRENT_LIMIT = 64, ports.rs:9) |
RateLimitExceeded |
429 |
back off (excluded from uptime) ✅ |
| Upstream 429 (provider rate limit) |
RateLimitExceeded |
429 |
back off ✅ |
All backends exhausted after retry_with_fallback rotation (SGLang --max-queued-requests 503s) |
ServiceOverloaded |
529 |
downtime ❌ |
| Upstream 503 |
ServiceOverloaded |
529 |
downtime ❌ |
Note: rotation-SNI (PR #637 / cloud-api.md "HTTP 529 vs 503") already made 529 rare in prod — ~1 event/9h, down 360× from before rotation. So the routing-rank exposure is small today, but it is a real and avoidable signal.
Proposed fix (recommended)
Add an OpenRouter-scoped 529→429 remap at the route layer, opt-in, with zero behavior change for every other client:
- The chat-completions handler already has both
headers: HeaderMap and api_key.organization.id in scope at the error arms (crates/api/src/routes/completions.rs:1126, error arms at :1528/:1634). So either of these scoping mechanisms is a few lines:
- Header-based: e.g. honor a request header (
x-overload-status: 429) — clean, but requires OpenRouter to send a custom header on every request (may not be configurable on their side).
- Org-scoped (recommended): when the request's org is the dedicated OpenRouter integration org, map
ServiceOverloaded → 429 instead of 529. Self-contained, needs no cooperation from OR, and matches OR's documented "early 429, don't queue" expectation.
- Keep 529 as the default for all other clients (Anthropic-style "site overloaded, retry with backoff" is the correct semantic for our own SDK/UI clients and for honest observability).
- Body should still carry a clear
Retry-After-style hint and service_overloaded error type so the meaning isn't lost.
Effort: ~half a day. One small route-layer branch + a unit test mirroring test_map_domain_error_service_overloaded. No service-layer, DB, or migration changes. The mapping function map_domain_error_to_status would gain an org/header-aware variant used only by the chat + responses handlers (the 5 per-handler ServiceOverloaded arms — audio/rerank/embeddings/privacy/score — are not OR-listed chat models, so they can keep 529).
Alternative considered: keep 529 everywhere
Rationale to keep: 529 is the honest, correct status; rotation-SNI already made it rare; and remapping hides a real overload from our own dashboards if applied globally. Rejected as a global default but the org-scoped remap above gets the best of both — honest 529 for us, OR-friendly 429 for the OR path.
Right-sized concurrent limit for the OpenRouter org
The per-(org,model) cap must be below the smallest single-backend admission window so that under OR-driven load we shed with 429 (our own cap) before the request ever reaches a saturated SGLang/vLLM backend that would 503→529. Backend capacity (cvm-conf + /machines replica counts, 2026-06-08):
| Model |
Per-backend admission |
Replicas |
Notes |
| zai-org/GLM-5.1-FP8 |
~128 running + --max-queued-requests 8 |
5 |
SGLang, queue-capped |
| Qwen/Qwen3.5-122B-A10B |
memory-bound, unbounded queue |
1 |
single-replica risk model |
| deepseek-ai/DeepSeek-V4-Flash |
--max-running-requests 128 |
1 |
|
| Qwen/Qwen3.6-27B-FP8 |
--max-running-requests 128 |
1 |
|
| Qwen/Qwen3.6-35B-A3B-FP8 |
--max-running-requests 128 |
3 |
|
| google/gemma-4-31B-it |
--max-running-requests 64 |
3 |
smallest running cap |
| Qwen/Qwen3-30B-A3B-Instruct-2507 |
--max-num-seqs 64 |
2 |
|
| openai/gpt-oss-120b |
vLLM default |
2 |
|
| Qwen/Qwen3-VL-30B-A3B-Instruct |
--max-num-seqs 16 |
2 |
smallest seq cap (vision) |
The cap is a single scalar applied per (org, model), so it must respect the smallest single-backend window to guarantee a 429 instead of a 529 across the board. The binding constraints are gemma-4-31B (64 running) and the single-replica Qwen3.5-122B (no queue cap). A cap of 48 leaves headroom under the 64-running floor while still allowing meaningful OR throughput; it sits well under GLM-5.1's ~136/backend and the 128-running models. (If we later split per-model caps, OR could get a higher cap on the multi-replica 128-running models.)
PATCH /v1/admin/organizations/{OPENROUTER_ORG_ID}/concurrent-limit
Content-Type: application/json
Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>
{"concurrent_limit": 48}
Exact curl (fill in the OpenRouter org UUID + admin token; do NOT run blindly):
curl -sS -X PATCH \
"https://cloud-api.near.ai/v1/admin/organizations/<OPENROUTER_ORG_ID>/concurrent-limit" \
-H "Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"concurrent_limit": 48}'
Cache note: read path has a 5-min moka TTL; the PATCH invalidates locally (PR #618) but the other replica (cpu01↔cpu02) waits out its own TTL (cloud-api.md "Multi-instance caveat").
Acceptance
Refs: crates/api/src/routes/common.rs:13,18; crates/services/src/completions/mod.rs:794,1016; crates/services/src/completions/ports.rs:9; docs/cloud-api.md (HTTP 529 vs 503; Per-org concurrent-request limit); docs/inference.md (--max-queued-requests).
🤖 Generated with Claude Code
Summary
cloud-api surfaces backend exhaustion as HTTP 529 (
CompletionError::ServiceOverloaded→status_overloaded()incrates/api/src/routes/common.rs:13). OpenRouter's provider uptime calculation excludes 429 but counts every 500+ response (including 529) as downtime. So our overload signal, which we deliberately chose as "retry with backoff," is read by OpenRouter as a hard outage and degrades our routing rank:429 (per-org concurrent-slot exhaustion,
try_acquire_concurrent_slotincrates/services/src/completions/mod.rs:1016) is correctly excluded by OpenRouter and is exactly the "back off" signal OR wants. OpenRouter explicitly asks providers to return early 429s under load and to not queue.This does NOT block the OpenRouter listing — it affects post-listing routing rank, not eligibility.
Current behavior (confirmed in code)
DEFAULT_CONCURRENT_LIMIT = 64,ports.rs:9)RateLimitExceededRateLimitExceededretry_with_fallbackrotation (SGLang--max-queued-requests503s)ServiceOverloadedServiceOverloadedNote: rotation-SNI (PR #637 / cloud-api.md "HTTP 529 vs 503") already made 529 rare in prod — ~1 event/9h, down 360× from before rotation. So the routing-rank exposure is small today, but it is a real and avoidable signal.
Proposed fix (recommended)
Add an OpenRouter-scoped 529→429 remap at the route layer, opt-in, with zero behavior change for every other client:
headers: HeaderMapandapi_key.organization.idin scope at the error arms (crates/api/src/routes/completions.rs:1126, error arms at:1528/:1634). So either of these scoping mechanisms is a few lines:x-overload-status: 429) — clean, but requires OpenRouter to send a custom header on every request (may not be configurable on their side).ServiceOverloaded→ 429 instead of 529. Self-contained, needs no cooperation from OR, and matches OR's documented "early 429, don't queue" expectation.Retry-After-style hint andservice_overloadederror type so the meaning isn't lost.Effort: ~half a day. One small route-layer branch + a unit test mirroring
test_map_domain_error_service_overloaded. No service-layer, DB, or migration changes. The mapping functionmap_domain_error_to_statuswould gain an org/header-aware variant used only by the chat + responses handlers (the 5 per-handlerServiceOverloadedarms — audio/rerank/embeddings/privacy/score — are not OR-listed chat models, so they can keep 529).Alternative considered: keep 529 everywhere
Rationale to keep: 529 is the honest, correct status; rotation-SNI already made it rare; and remapping hides a real overload from our own dashboards if applied globally. Rejected as a global default but the org-scoped remap above gets the best of both — honest 529 for us, OR-friendly 429 for the OR path.
Right-sized concurrent limit for the OpenRouter org
The per-(org,model) cap must be below the smallest single-backend admission window so that under OR-driven load we shed with 429 (our own cap) before the request ever reaches a saturated SGLang/vLLM backend that would 503→529. Backend capacity (cvm-conf +
/machinesreplica counts, 2026-06-08):--max-queued-requests 8--max-running-requests 128--max-running-requests 128--max-running-requests 128--max-running-requests 64--max-num-seqs 64--max-num-seqs 16The cap is a single scalar applied per (org, model), so it must respect the smallest single-backend window to guarantee a 429 instead of a 529 across the board. The binding constraints are gemma-4-31B (64 running) and the single-replica Qwen3.5-122B (no queue cap). A cap of 48 leaves headroom under the 64-running floor while still allowing meaningful OR throughput; it sits well under GLM-5.1's ~136/backend and the 128-running models. (If we later split per-model caps, OR could get a higher cap on the multi-replica 128-running models.)
Exact curl (fill in the OpenRouter org UUID + admin token; do NOT run blindly):
Cache note: read path has a 5-min moka TTL; the PATCH invalidates locally (PR #618) but the other replica (cpu01↔cpu02) waits out its own TTL (cloud-api.md "Multi-instance caveat").
Acceptance
ServiceOverloadedreturns 429, not 529; all other clients still get 529.concurrent_limitset to 48 (or final agreed value) via the PATCH above.Refs:
crates/api/src/routes/common.rs:13,18;crates/services/src/completions/mod.rs:794,1016;crates/services/src/completions/ports.rs:9; docs/cloud-api.md (HTTP 529 vs 503; Per-org concurrent-request limit); docs/inference.md (--max-queued-requests).🤖 Generated with Claude Code