OpenRouter routing rank: map overload to 429 (not 529) on the OR path + right-size its concurrent limit

## Summary

cloud-api surfaces backend exhaustion as **HTTP 529** (`CompletionError::ServiceOverloaded` → `status_overloaded()` in `crates/api/src/routes/common.rs:13`). OpenRouter's provider uptime calculation **excludes 429 but counts every 500+ response (including 529) as downtime**. So our overload signal, which we deliberately chose as "retry with backoff," is read by OpenRouter as a hard outage and degrades our routing rank:

- 95%+ uptime → normal routing
- 80–94% → lower priority
- <80% → fallback-only

429 (per-org concurrent-slot exhaustion, `try_acquire_concurrent_slot` in `crates/services/src/completions/mod.rs:1016`) is correctly excluded by OpenRouter and is exactly the "back off" signal OR wants. OpenRouter explicitly asks providers to return early 429s under load and to **not queue**.

**This does NOT block the OpenRouter listing** — it affects post-listing routing rank, not eligibility.

## Current behavior (confirmed in code)

| Condition | Domain error | HTTP | OR treats as |
|---|---|---|---|
| Per-(org,model) concurrent cap hit (`DEFAULT_CONCURRENT_LIMIT = 64`, `ports.rs:9`) | `RateLimitExceeded` | **429** | back off (excluded from uptime) ✅ |
| Upstream 429 (provider rate limit) | `RateLimitExceeded` | **429** | back off ✅ |
| All backends exhausted after `retry_with_fallback` rotation (SGLang `--max-queued-requests` 503s) | `ServiceOverloaded` | **529** | **downtime** ❌ |
| Upstream 503 | `ServiceOverloaded` | **529** | **downtime** ❌ |

Note: rotation-SNI (PR #637 / cloud-api.md "HTTP 529 vs 503") already made 529 **rare** in prod — ~1 event/9h, down 360× from before rotation. So the routing-rank exposure is small today, but it is a real and avoidable signal.

## Proposed fix (recommended)

Add an **OpenRouter-scoped 529→429 remap at the route layer**, opt-in, with zero behavior change for every other client:

- The chat-completions handler already has both `headers: HeaderMap` and `api_key.organization.id` in scope at the error arms (`crates/api/src/routes/completions.rs:1126`, error arms at `:1528`/`:1634`). So either of these scoping mechanisms is a few lines:
  - **Header-based**: e.g. honor a request header (`x-overload-status: 429`) — clean, but requires OpenRouter to send a custom header on every request (may not be configurable on their side).
  - **Org-scoped** (recommended): when the request's org is the dedicated OpenRouter integration org, map `ServiceOverloaded` → 429 instead of 529. Self-contained, needs no cooperation from OR, and matches OR's documented "early 429, don't queue" expectation.
- Keep 529 as the default for all other clients (Anthropic-style "site overloaded, retry with backoff" is the correct semantic for our own SDK/UI clients and for honest observability).
- Body should still carry a clear `Retry-After`-style hint and `service_overloaded` error type so the meaning isn't lost.

**Effort: ~half a day.** One small route-layer branch + a unit test mirroring `test_map_domain_error_service_overloaded`. No service-layer, DB, or migration changes. The mapping function `map_domain_error_to_status` would gain an org/header-aware variant used only by the chat + responses handlers (the 5 per-handler `ServiceOverloaded` arms — audio/rerank/embeddings/privacy/score — are not OR-listed chat models, so they can keep 529).

### Alternative considered: keep 529 everywhere

Rationale to keep: 529 is the honest, correct status; rotation-SNI already made it rare; and remapping hides a real overload from our own dashboards if applied globally. **Rejected as a global default** but the org-scoped remap above gets the best of both — honest 529 for us, OR-friendly 429 for the OR path.

## Right-sized concurrent limit for the OpenRouter org

The per-(org,model) cap must be **below** the smallest single-backend admission window so that under OR-driven load we shed with **429 (our own cap)** *before* the request ever reaches a saturated SGLang/vLLM backend that would 503→529. Backend capacity (cvm-conf + `/machines` replica counts, 2026-06-08):

| Model | Per-backend admission | Replicas | Notes |
|---|---|---|---|
| zai-org/GLM-5.1-FP8 | ~128 running + `--max-queued-requests 8` | **5** | SGLang, queue-capped |
| Qwen/Qwen3.5-122B-A10B | memory-bound, **unbounded queue** | **1** | single-replica risk model |
| deepseek-ai/DeepSeek-V4-Flash | `--max-running-requests 128` | 1 | |
| Qwen/Qwen3.6-27B-FP8 | `--max-running-requests 128` | 1 | |
| Qwen/Qwen3.6-35B-A3B-FP8 | `--max-running-requests 128` | 3 | |
| google/gemma-4-31B-it | `--max-running-requests 64` | 3 | smallest running cap |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | `--max-num-seqs 64` | 2 | |
| openai/gpt-oss-120b | vLLM default | 2 | |
| Qwen/Qwen3-VL-30B-A3B-Instruct | `--max-num-seqs 16` | 2 | smallest seq cap (vision) |

The cap is a single scalar applied **per (org, model)**, so it must respect the *smallest* single-backend window to guarantee a 429 instead of a 529 across the board. The binding constraints are gemma-4-31B (64 running) and the single-replica Qwen3.5-122B (no queue cap). A cap of **48** leaves headroom under the 64-running floor while still allowing meaningful OR throughput; it sits well under GLM-5.1's ~136/backend and the 128-running models. (If we later split per-model caps, OR could get a higher cap on the multi-replica 128-running models.)

```
PATCH /v1/admin/organizations/{OPENROUTER_ORG_ID}/concurrent-limit
Content-Type: application/json
Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>

{"concurrent_limit": 48}
```

Exact curl (fill in the OpenRouter org UUID + admin token; do NOT run blindly):

```
curl -sS -X PATCH \
  "https://cloud-api.near.ai/v1/admin/organizations/<OPENROUTER_ORG_ID>/concurrent-limit" \
  -H "Authorization: Bearer <NEAR_AI_CLOUD_ADMIN_ACCESS_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"concurrent_limit": 48}'
```

Cache note: read path has a 5-min moka TTL; the PATCH invalidates locally (PR #618) but the *other* replica (cpu01↔cpu02) waits out its own TTL (cloud-api.md "Multi-instance caveat").

## Acceptance

- [ ] OR-path (org-scoped) `ServiceOverloaded` returns 429, not 529; all other clients still get 529.
- [ ] Unit test asserting the org/header-scoped remap.
- [ ] OpenRouter org `concurrent_limit` set to 48 (or final agreed value) via the PATCH above.
- [ ] Verified on cloud-stg-api before prod.

Refs: `crates/api/src/routes/common.rs:13,18`; `crates/services/src/completions/mod.rs:794,1016`; `crates/services/src/completions/ports.rs:9`; docs/cloud-api.md (HTTP 529 vs 503; Per-org concurrent-request limit); docs/inference.md (`--max-queued-requests`).

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Condition	Domain error	HTTP	OR treats as
Per-(org,model) concurrent cap hit (`DEFAULT_CONCURRENT_LIMIT = 64`, `ports.rs:9`)	`RateLimitExceeded`	429	back off (excluded from uptime) ✅
Upstream 429 (provider rate limit)	`RateLimitExceeded`	429	back off ✅
All backends exhausted after `retry_with_fallback` rotation (SGLang `--max-queued-requests` 503s)	`ServiceOverloaded`	529	downtime ❌
Upstream 503	`ServiceOverloaded`	529	downtime ❌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenRouter routing rank: map overload to 429 (not 529) on the OR path + right-size its concurrent limit #738

Summary

Current behavior (confirmed in code)

Proposed fix (recommended)

Alternative considered: keep 529 everywhere

Right-sized concurrent limit for the OpenRouter org

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Per-backend admission	Replicas	Notes
zai-org/GLM-5.1-FP8	~128 running + `--max-queued-requests 8`	5	SGLang, queue-capped
Qwen/Qwen3.5-122B-A10B	memory-bound, unbounded queue	1	single-replica risk model
deepseek-ai/DeepSeek-V4-Flash	`--max-running-requests 128`	1
Qwen/Qwen3.6-27B-FP8	`--max-running-requests 128`	1
Qwen/Qwen3.6-35B-A3B-FP8	`--max-running-requests 128`	3
google/gemma-4-31B-it	`--max-running-requests 64`	3	smallest running cap
Qwen/Qwen3-30B-A3B-Instruct-2507	`--max-num-seqs 64`	2
openai/gpt-oss-120b	vLLM default	2
Qwen/Qwen3-VL-30B-A3B-Instruct	`--max-num-seqs 16`	2	smallest seq cap (vision)

OpenRouter routing rank: map overload to 429 (not 529) on the OR path + right-size its concurrent limit #738

Description

Summary

Current behavior (confirmed in code)

Proposed fix (recommended)

Alternative considered: keep 529 everywhere

Right-sized concurrent limit for the OpenRouter org

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions