Symptom
cloud-api services::inference_provider_pool treats upstream 5xx from the inference proxy as retryable_http_5xx and retries the same payload 3 times. When the upstream 5xx is actually caused by bad client input (a multimodal request referencing a broken/unsupported media URL), each such request gets 4 attempts (1 + 3 retries) of identical work, amplifying load 4× on the backends.
Observed across two incidents on google/gemma-4-31B-it and Qwen/Qwen3.5-122B-A10B over the past two days. Same retry-amplification pattern; in 2026-05-27 it combined with a SGLang gemma4 engine crash bug (separate cvm-compose-files fix v0.0.196) to produce a saturation outage on both gemma backends.
Evidence
cloud-api logs (message: All providers failed for model, error_kind: http_5xx, retry_decision: retryable_http_5xx) — same request_id appears 4× with the identical underlying error_detail:
```
2026-05-26 17:07:21 — request id e.g. 9bb71bb8 attempts 1
2026-05-26 17:07:24 — same request, retry 1 → same error_detail
2026-05-26 17:07:26 — retry 2 → same error_detail
2026-05-26 17:07:30 — retry 3 → "All providers failed for model"
```
The underlying error_detail bodies are always one of these patterns from the inference engine (vLLM or SGLang) trying to fetch/decode a client-supplied media URL:
- `HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/v24.0/...'\`
- `HTTP error 500: Internal server error: An exception occurred while loading IMAGE data at index 0: Error while loading data ... 403 Client Error: Forbidden for url: https://external.fsyd16-2.fna.fbcdn.net/...\`
- `HTTP error 500: Internal server error: An exception occurred while loading VIDEO data at index 0: ... 429 Client Error: Too Many Requests for url: https://www.google.com/sorry/...\` (YouTube)
- `HTTP error 500: Internal server error: ... cannot identify image file <_io.BytesIO ...>` (e.g. base64 mp4 sent as `image_url`)
- `HTTP error 500: ... SingleStreamDecoder, ... Failed to open input buffer: Invalid data found when processing input` (torchcodec on a broken video URL)
These are permanent client errors — retrying the same payload cannot succeed.
Impact
- 4× extra compute per malformed request, latency to the abusive client increased without benefit (it always 5xxes after ~5–8s).
- Under a sustained flood of bad-media requests, the amplification saturates the gemma backends (observed 2026-05-27 ~16:50–17:00 UTC: gpu11 backend queue grew to ~150, both backends timed out user traffic until the queue drained).
- The original 2026-05-26 17:07:30 "All providers failed" alert (and many like it) were not infra failures — they were retried bad inputs.
Proposed fix
In the retry classifier (same module/pattern as the existing 400→`Client error from provider, not retrying` path, and PR #611's `classify_provider_error` pattern), pattern-match the upstream `error_detail` body on these strings and downgrade to non-retryable client error even when the upstream HTTP status is 500:
- `Error while loading data` / `An exception occurred while loading (IMAGE|VIDEO) data`
- `cannot identify image file`
- `Failed to open input buffer`
(Ideally also surface as 422 to the client so they know it's their input, not our infra. Behind a small allowlist of substrings to avoid masking real backend bugs.)
Followups (out of scope here)
- Inference engines (vLLM, SGLang) should return 4xx for media-fetch/decode failures, not 500.
- Consider rate-limiting clients that produce sustained malformed-media bursts.
References
Symptom
cloud-api
services::inference_provider_pooltreats upstream5xxfrom the inference proxy asretryable_http_5xxand retries the same payload 3 times. When the upstream5xxis actually caused by bad client input (a multimodal request referencing a broken/unsupported media URL), each such request gets 4 attempts (1 + 3 retries) of identical work, amplifying load 4× on the backends.Observed across two incidents on
google/gemma-4-31B-itandQwen/Qwen3.5-122B-A10Bover the past two days. Same retry-amplification pattern; in 2026-05-27 it combined with a SGLang gemma4 engine crash bug (separate cvm-compose-files fix v0.0.196) to produce a saturation outage on both gemma backends.Evidence
cloud-api logs (
message: All providers failed for model,error_kind: http_5xx,retry_decision: retryable_http_5xx) — samerequest_idappears 4× with the identical underlyingerror_detail:```
2026-05-26 17:07:21 — request id e.g. 9bb71bb8 attempts 1
2026-05-26 17:07:24 — same request, retry 1 → same error_detail
2026-05-26 17:07:26 — retry 2 → same error_detail
2026-05-26 17:07:30 — retry 3 → "All providers failed for model"
```
The underlying
error_detailbodies are always one of these patterns from the inference engine (vLLM or SGLang) trying to fetch/decode a client-supplied media URL:These are permanent client errors — retrying the same payload cannot succeed.
Impact
Proposed fix
In the retry classifier (same module/pattern as the existing 400→`Client error from provider, not retrying` path, and PR #611's `classify_provider_error` pattern), pattern-match the upstream `error_detail` body on these strings and downgrade to non-retryable client error even when the upstream HTTP status is 500:
(Ideally also surface as 422 to the client so they know it's their input, not our infra. Behind a small allowlist of substrings to avoid masking real backend bugs.)
Followups (out of scope here)
References
--disable-fast-image-processor)