Skip to content

OpenRouter routing rank: map overload to 429 (not 529) on the OR path + right-size its concurrent limit #738

@Evrard-Nil

Description

@Evrard-Nil

Summary

cloud-api surfaces backend exhaustion as HTTP 529 (CompletionError::ServiceOverloadedstatus_overloaded() in crates/api/src/routes/common.rs:13). OpenRouter's provider uptime calculation excludes 429 but counts every 500+ response (including 529) as downtime. So our overload signal, which we deliberately chose as "retry with backoff," is read by OpenRouter as a hard outage and degrades our routing rank:

  • 95%+ uptime → normal routing
  • 80–94% → lower priority
  • <80% → fallback-only

429 (per-org concurrent-slot exhaustion, try_acquire_concurrent_slot in crates/services/src/completions/mod.rs:1016) is correctly excluded by OpenRouter and is exactly the "back off" signal OR wants. OpenRouter explicitly asks providers to return early 429s under load and to not queue.

This does NOT block the OpenRouter listing — it affects post-listing routing rank, not eligibility.

Current behavior (confirmed in code)

Condition Domain error HTTP OR treats as
Per-(org,model) concurrent cap hit (DEFAULT_CONCURRENT_LIMIT = 64, ports.rs:9) RateLimitExceeded 429 back off (excluded from uptime) ✅
Upstream 429 (provider rate limit) RateLimitExceeded 429 back off ✅
All backends exhausted after retry_with_fallback rotation (SGLang --max-queued-requests 503s) ServiceOverloaded 529 downtime
Upstream 503 ServiceOverloaded 529 downtime

Note: rotation-SNI (PR #637 / cloud-api.md "HTTP 529 vs 503") already made 529 rare in prod — ~1 event/9h, down 360× from before rotation. So the routing-rank exposure is small today, but it is a real and avoidable signal.

Proposed fix (recommended)

Add an OpenRouter-scoped 529→429 remap at the route layer, opt-in, with zero behavior change for every other client:

  • The chat-completions handler already has both headers: HeaderMap and api_key.organization.id in scope at the error arms (crates/api/src/routes/completions.rs:1126, error arms at :1528/:1634). So either of these scoping mechanisms is a few lines:
    • Header-based: e.g. honor a request header (x-overload-status: 429) — clean, but requires OpenRouter to send a custom header on every request (may not be configurable on their side).
    • Org-scoped (recommended): when the request's org is the dedicated OpenRouter integration org, map ServiceOverloaded → 429 instead of 529. Self-contained, needs no cooperation from OR, and matches OR's documented "early 429, don't queue" expectation.
  • Keep 529 as the default for all other clients (Anthropic-style "site overloaded, retry with backoff" is the correct semantic for our own SDK/UI clients and for honest observability).
  • Body should still carry a clear Retry-After-style hint and service_overloaded error type so the meaning isn't lost.

Effort: ~half a day. One small route-layer branch + a unit test mirroring test_map_domain_error_service_overloaded. No service-layer, DB, or migration changes. The mapping function map_domain_error_to_status would gain an org/header-aware variant used only by the chat + responses handlers (the 5 per-handler ServiceOverloaded arms — audio/rerank/embeddings/privacy/score — are not OR-listed chat models, so they can keep 529).

Alternative considered: keep 529 everywhere

Rationale to keep: 529 is the honest, correct status; rotation-SNI already made it rare; and remapping hides a real overload from our own dashboards if applied globally. Rejected as a global default but the org-scoped remap above gets the best of both — honest 529 for us, OR-friendly 429 for the OR path.

Right-sized concurrent limit for the OpenRouter org

The per-(org,model) cap must be below the smallest single-backend admission window so that under OR-driven load we shed with 429 (our own cap) before the request ever reaches a saturated SGLang/vLLM backend that would 503→529. Backend capacity (cvm-conf + /machines replica counts, 2026-06-08):

Model Per-backend admission Replicas Notes
zai-org/GLM-5.1-FP8 ~128 running + --max-queued-requests 8 5 SGLang, queue-capped
Qwen/Qwen3.5-122B-A10B memory-bound, unbounded queue 1 single-replica risk model
deepseek-ai/DeepSeek-V4-Flash --max-running-requests 128 1
Qwen/Qwen3.6-27B-FP8 --max-running-requests 128 1
Qwen/Qwen3.6-35B-A3B-FP8 --max-running-requests 128 3
google/gemma-4-31B-it --max-running-requests 64 3 smallest running cap
Qwen/Qwen3-30B-A3B-Instruct-2507 --max-num-seqs 64 2
openai/gpt-oss-120b vLLM default 2
Qwen/Qwen3-VL-30B-A3B-Instruct --max-num-seqs 16 2 smallest seq cap (vision)

The cap is a single scalar applied per (org, model), so it must respect the smallest single-backend window to guarantee a 429 instead of a 529 across the board. The binding constraints are gemma-4-31B (64 running) and the single-replica Qwen3.5-122B (no queue cap). A cap of 48 leaves headroom under the 64-running floor while still allowing meaningful OR throughput; it sits well under GLM-5.1's ~136/backend and the 128-running models. (If we later split per-model caps, OR could get a higher cap on the multi-replica 128-running models.)

PATCH /v1/admin/organizations/{OPENROUTER_ORG_ID}/concurrent-limit
Content-Type: application/json
Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>

{"concurrent_limit": 48}

Exact curl (fill in the OpenRouter org UUID + admin token; do NOT run blindly):

curl -sS -X PATCH \
  "https://cloud-api.near.ai/v1/admin/organizations/<OPENROUTER_ORG_ID>/concurrent-limit" \
  -H "Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"concurrent_limit": 48}'

Cache note: read path has a 5-min moka TTL; the PATCH invalidates locally (PR #618) but the other replica (cpu01↔cpu02) waits out its own TTL (cloud-api.md "Multi-instance caveat").

Acceptance

  • OR-path (org-scoped) ServiceOverloaded returns 429, not 529; all other clients still get 529.
  • Unit test asserting the org/header-scoped remap.
  • OpenRouter org concurrent_limit set to 48 (or final agreed value) via the PATCH above.
  • Verified on cloud-stg-api before prod.

Refs: crates/api/src/routes/common.rs:13,18; crates/services/src/completions/mod.rs:794,1016; crates/services/src/completions/ports.rs:9; docs/cloud-api.md (HTTP 529 vs 503; Per-org concurrent-request limit); docs/inference.md (--max-queued-requests).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions