cloud-api retries media-fetch upstream 500s — 4× amplification → backend saturation

## Symptom

cloud-api `services::inference_provider_pool` treats upstream `5xx` from the inference proxy as `retryable_http_5xx` and retries the same payload 3 times. When the upstream `5xx` is actually caused by **bad client input** (a multimodal request referencing a broken/unsupported media URL), each such request gets **4 attempts** (1 + 3 retries) of identical work, amplifying load 4× on the backends.

Observed across two incidents on `google/gemma-4-31B-it` and `Qwen/Qwen3.5-122B-A10B` over the past two days. Same retry-amplification pattern; in 2026-05-27 it combined with a SGLang gemma4 engine crash bug (separate cvm-compose-files fix v0.0.196) to produce a saturation outage on both gemma backends.

## Evidence

cloud-api logs (`message: All providers failed for model`, `error_kind: http_5xx`, `retry_decision: retryable_http_5xx`) — same `request_id` appears 4× with the identical underlying `error_detail`:

\`\`\`
2026-05-26 17:07:21 — request id e.g. 9bb71bb8 attempts 1
2026-05-26 17:07:24 — same request, retry 1 → same error_detail
2026-05-26 17:07:26 — retry 2 → same error_detail
2026-05-26 17:07:30 — retry 3 → "All providers failed for model"
\`\`\`

The underlying `error_detail` bodies are always one of these patterns from the inference engine (vLLM or SGLang) trying to fetch/decode a client-supplied media URL:

- \`HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/v24.0/...'\`
- \`HTTP error 500: Internal server error: An exception occurred while loading IMAGE data at index 0: Error while loading data ... 403 Client Error: Forbidden for url: https://external.fsyd16-2.fna.fbcdn.net/...\`
- \`HTTP error 500: Internal server error: An exception occurred while loading VIDEO data at index 0: ... 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/...\` (YouTube)
- \`HTTP error 500: Internal server error: ... cannot identify image file <_io.BytesIO ...>\` (e.g. base64 mp4 sent as \`image_url\`)
- \`HTTP error 500: ... SingleStreamDecoder, ... Failed to open input buffer: Invalid data found when processing input\` (torchcodec on a broken video URL)

These are **permanent client errors** — retrying the same payload cannot succeed.

## Impact

- 4× extra compute per malformed request, latency to the abusive client increased without benefit (it always 5xxes after ~5–8s).
- Under a sustained flood of bad-media requests, the amplification saturates the gemma backends (observed 2026-05-27 ~16:50–17:00 UTC: gpu11 backend queue grew to ~150, both backends timed out user traffic until the queue drained).
- The original 2026-05-26 17:07:30 \"All providers failed\" alert (and many like it) were not infra failures — they were retried bad inputs.

## Proposed fix

In the retry classifier (same module/pattern as the existing 400→\`Client error from provider, not retrying\` path, and PR #611's \`classify_provider_error\` pattern), pattern-match the upstream \`error_detail\` body on these strings and downgrade to **non-retryable client error** even when the upstream HTTP status is 500:

- \`Error while loading data\` / \`An exception occurred while loading (IMAGE|VIDEO) data\`
- \`cannot identify image file\`
- \`Failed to open input buffer\`

(Ideally also surface as 422 to the client so they know it's their input, not our infra. Behind a small allowlist of substrings to avoid masking real backend bugs.)

## Followups (out of scope here)

- Inference engines (vLLM, SGLang) should return 4xx for media-fetch/decode failures, not 500.
- Consider rate-limiting clients that produce sustained malformed-media bursts.

## References

- cvm-compose-files PR #48 (vLLM→SGLang on gemma-4-31B-it, surfaced this clearly)
- cvm-compose-files v0.0.196 (separate SGLang crash fix, `--disable-fast-image-processor`)
- This repo PR #611 (\`classify_provider_error\` pattern for upstream auth errors — same shape of fix)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud-api retries media-fetch upstream 500s — 4× amplification → backend saturation #687

Symptom

Evidence

Impact

Proposed fix

Followups (out of scope here)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cloud-api retries media-fetch upstream 500s — 4× amplification → backend saturation #687

Description

Symptom

Evidence

Impact

Proposed fix

Followups (out of scope here)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions