Skip to content

cloud-api retries media-fetch upstream 500s — 4× amplification → backend saturation #687

@Evrard-Nil

Description

@Evrard-Nil

Symptom

cloud-api services::inference_provider_pool treats upstream 5xx from the inference proxy as retryable_http_5xx and retries the same payload 3 times. When the upstream 5xx is actually caused by bad client input (a multimodal request referencing a broken/unsupported media URL), each such request gets 4 attempts (1 + 3 retries) of identical work, amplifying load 4× on the backends.

Observed across two incidents on google/gemma-4-31B-it and Qwen/Qwen3.5-122B-A10B over the past two days. Same retry-amplification pattern; in 2026-05-27 it combined with a SGLang gemma4 engine crash bug (separate cvm-compose-files fix v0.0.196) to produce a saturation outage on both gemma backends.

Evidence

cloud-api logs (message: All providers failed for model, error_kind: http_5xx, retry_decision: retryable_http_5xx) — same request_id appears 4× with the identical underlying error_detail:

```
2026-05-26 17:07:21 — request id e.g. 9bb71bb8 attempts 1
2026-05-26 17:07:24 — same request, retry 1 → same error_detail
2026-05-26 17:07:26 — retry 2 → same error_detail
2026-05-26 17:07:30 — retry 3 → "All providers failed for model"
```

The underlying error_detail bodies are always one of these patterns from the inference engine (vLLM or SGLang) trying to fetch/decode a client-supplied media URL:

  • `HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/v24.0/...'\`
  • `HTTP error 500: Internal server error: An exception occurred while loading IMAGE data at index 0: Error while loading data ... 403 Client Error: Forbidden for url: https://external.fsyd16-2.fna.fbcdn.net/...\`
  • `HTTP error 500: Internal server error: An exception occurred while loading VIDEO data at index 0: ... 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/...\` (YouTube)
  • `HTTP error 500: Internal server error: ... cannot identify image file <_io.BytesIO ...>` (e.g. base64 mp4 sent as `image_url`)
  • `HTTP error 500: ... SingleStreamDecoder, ... Failed to open input buffer: Invalid data found when processing input` (torchcodec on a broken video URL)

These are permanent client errors — retrying the same payload cannot succeed.

Impact

  • 4× extra compute per malformed request, latency to the abusive client increased without benefit (it always 5xxes after ~5–8s).
  • Under a sustained flood of bad-media requests, the amplification saturates the gemma backends (observed 2026-05-27 ~16:50–17:00 UTC: gpu11 backend queue grew to ~150, both backends timed out user traffic until the queue drained).
  • The original 2026-05-26 17:07:30 "All providers failed" alert (and many like it) were not infra failures — they were retried bad inputs.

Proposed fix

In the retry classifier (same module/pattern as the existing 400→`Client error from provider, not retrying` path, and PR #611's `classify_provider_error` pattern), pattern-match the upstream `error_detail` body on these strings and downgrade to non-retryable client error even when the upstream HTTP status is 500:

  • `Error while loading data` / `An exception occurred while loading (IMAGE|VIDEO) data`
  • `cannot identify image file`
  • `Failed to open input buffer`

(Ideally also surface as 422 to the client so they know it's their input, not our infra. Behind a small allowlist of substrings to avoid masking real backend bugs.)

Followups (out of scope here)

  • Inference engines (vLLM, SGLang) should return 4xx for media-fetch/decode failures, not 500.
  • Consider rate-limiting clients that produce sustained malformed-media bursts.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions