Skip to content

inference_provider_pool: don't retry upstream 5xx caused by client media-fetch errors#688

Open
Evrard-Nil wants to merge 3 commits into
mainfrom
retry-classify-media-fetch
Open

inference_provider_pool: don't retry upstream 5xx caused by client media-fetch errors#688
Evrard-Nil wants to merge 3 commits into
mainfrom
retry-classify-media-fetch

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Collaborator

Fixes #687.

Summary

vLLM and SGLang return HTTP 500 when they can't fetch or decode a multimodal media URL the client supplied (broken Facebook CDN image URLs, geo-blocked YouTube videos, base64 mp4 sent as image_url, etc.). These are permanent client-input errors — the same payload re-runs the same fetch and produces the same failure. But classify_retry_decision treated every 5xx the same way → retryable_http_5xx, so each malformed request got 4 attempts (1 + 3 retries) of identical work.

Under sustained bad-media load this 4× amplifies until the backends saturate. Observed twice:

  • 2026-05-26 17:07 UTCgoogle/gemma-4-31B-it + Qwen/Qwen3.5-122B-A10B, ~40 All providers failed for model events in 20 min, same request_id retried 4× each on identical broken URLs (https://www.facebook.com/v24.0/... returning 404, https://www.youtube.com/watch?v=2w_pbUrVZHY failing torchcodec).
  • 2026-05-27 ~16:50 UTC — gemma backends saturated (gpu11 queue grew to ~150) until the queue drained. cloud-api gemma went 0/5 then recovered.

Fix

is_client_media_fetch_error(message) substring-matches the upstream error_detail body for high-confidence client-fetch markers:

  • loading image data / loading video data (sglang + vLLM)
  • cannot identify image file (PIL UnidentifiedImageError)
  • Failed to open input buffer (torchcodec)
  • , url='http… combined with , message='… (aiohttp wrapper format: HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/...')

In classify_retry_decision, the 5xx branch now consults this helper and returns non_retryable_client_media_error when it matches. The inner retry loop already gates on starts_with("retryable_"), so the label change is sufficient to stop the retry — no other plumbing needed.

Genuine backend 5xx (no media markers — KV-cache evictions, NCCL hiccups, OOM, etc.) still classify as retryable_http_5xx and retry as before.

Same shape as #611

This is the classify_provider_error pattern from PR #611 (upstream auth-error masking), applied to the retry decision instead of the response classification. The two complement each other: #611 protects callers from seeing our backend-creds bugs; this protects backends from being retry-amplified by client-input bugs.

What this does not do

  • Doesn't change the response status the client sees (still 500 from the underlying upstream message). A followup could surface these as 422 with a cleaner message — out of scope to keep the diff minimal.
  • Doesn't change error_kind (http_5xx stays accurate at the engine layer; only retry_decision flips).
  • Doesn't address the root cause upstream (engines should return 4xx for media-fetch failures) — see referenced issue.

Test plan

  • cargo test -p services --lib test_classify_retry_decision passes (added 5 new assertions: 4 verbatim from prod logs, 1 negative case to confirm non-media 500s still retry).
  • cargo check -p services clean.
  • CI green.
  • On rollout, watch the cloud-api All providers failed for model rate for google/gemma-4-31B-it and Qwen/Qwen3.5-122B-A10B — expect a significant drop in "4-attempts-same-error" patterns. Total RPS to inference backends should fall by ~3× of the malformed-media RPS.

…dia-fetch errors

vLLM and SGLang return HTTP 500 when they can't fetch or decode a
multimodal media URL the client supplied (e.g. broken Facebook CDN
image URLs, geo-blocked YouTube videos, base64 mp4 sent as image_url).
These are permanent client-input errors — the same payload re-runs the
same fetch and produces the same failure — but `classify_retry_decision`
treated every 5xx as `retryable_http_5xx`, so each malformed request
got 4 attempts (1 + 3 retries) of identical work.

Under sustained bad-media load (observed 2026-05-26 17:07 on
google/gemma-4-31B-it and Qwen/Qwen3.5-122B-A10B; recurred 2026-05-27
on gemma after the SGLang switch) the 4x amplification saturated both
backends and produced full-timeout windows on cloud-api.

Add a substring check on the upstream error_detail body: when an HTTP
500 message matches a high-confidence client-media-fetch pattern
('loading IMAGE/VIDEO data', 'cannot identify image file',
'Failed to open input buffer', or the aiohttp wrapper format
"HTTP error 500: 4xx, message='...', url='http..."), classify as
`non_retryable_client_media_error` so the inner retry loop bails out
after the first attempt. Genuine backend 5xx (no media markers) still
retry as before.

Same shape as PR #611's classify_provider_error pattern, applied to
the retry decision instead of the response classification.

Refs #687.
Copilot AI review requested due to automatic review settings May 28, 2026 08:51
@Evrard-Nil Evrard-Nil had a problem deploying to Cloud API test env May 28, 2026 08:51 — with GitHub Actions Failure
@Evrard-Nil Evrard-Nil requested a review from lloydmak99 May 28, 2026 08:51
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helper function is_client_media_fetch_error to classify certain HTTP 500 errors from inference engines (vLLM, SGLang) as non-retryable client media errors, preventing unnecessary retries on broken client-supplied URLs. Feedback points out that the substring check for the aiohttp wrapper format is too broad and could lead to false positives on genuine backend errors. It suggests using a more precise regular expression instead.

Comment on lines +1385 to +1394
fn is_client_media_fetch_error(message: &str) -> bool {
let lower = message.to_lowercase();
lower.contains("loading image data")
|| lower.contains("loading video data")
|| lower.contains("cannot identify image file")
|| lower.contains("failed to open input buffer")
// aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..."
// — the inference engine collapsed a client-fetch 4xx into a 500.
|| (lower.contains(", url='http") && lower.contains(", message='"))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The substring check lower.contains(", url='http") && lower.contains(", message='") is too broad and can lead to false positives. For example, if a genuine backend error (such as a database or internal service connection failure) returns a 5xx status code and happens to include those substrings in its error message, it will be incorrectly classified as a non-retryable client media error, preventing necessary retries.

Since the regex crate is already imported in this file, we can use a precise regular expression to match the aiohttp wrapper format and ensure that the collapsed status code is indeed a 4xx client error.

    fn is_client_media_fetch_error(message: &str) -> bool {
        let lower = message.to_lowercase();
        if lower.contains("loading image data")
            || lower.contains("loading video data")
            || lower.contains("cannot identify image file")
            || lower.contains("failed to open input buffer")
        {
            return true;
        }

        // aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..."
        // — the inference engine collapsed a client-fetch 4xx into a 500.
        static AIOHTTP_RE: std::sync::OnceLock<regex::Regex> = std::sync::OnceLock::new();
        let re = AIOHTTP_RE.get_or_init(|| {
            regex::Regex::new(r"(?i)http error 500:\s*4\d{2},\s*message='.*?',\s*url='https?://").unwrap()
        });
        re.is_match(message)
    }

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Review

Tightly scoped fix; the implementation matches the description and the test cases are pulled verbatim from prod logs (good). No critical issues found.

Minor observations (non-blocking)

  • aiohttp-wrapper heuristic is broader than the media path. (lower.contains(", url='http") && lower.contains(", message='")) (crates/services/src/inference_provider_pool/mod.rs:1393) does not actually require a media-fetch context — any aiohttp-style 5xx that includes both substrings is now classified non-retryable. In practice the only URLs engines fetch through aiohttp are client-supplied media, so retries against the same payload would not have helped anyway. But if vLLM/SGLang ever start using aiohttp to talk to internal services (KV cache, sidecar, etc.) and surface those errors with the same wrapper format, this branch could swallow legitimately retryable backend 5xx. Worth a follow-up to require a media marker alongside the url/message pair if you see false negatives on retry telemetry.

  • to_lowercase() allocation is fine. Called only on the error path, negligible overhead.

  • Privacy check. The helper consumes the message but only returns a &'static str label — no caller-controlled content leaks into logs. Matches the patterns set by classify_error_kind and classify_provider_error. ✓

  • Backward compatibility. The retry loop still gates on starts_with("retryable_"), so the new non_retryable_client_media_error label is sufficient to short-circuit retries without touching the outer plumbing. No state-migration or rolling-update concerns. ✓

Suggestion for the followup PR (out of scope)

As the PR body notes, the response status stays at 500 even though these are unambiguously client-input errors. Surfacing them as 422 with a sanitized message (no URL echoed back — privacy) would be a strict UX improvement and align with the error_kind taxonomy. Tracking issue would be useful.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts retry classification in InferenceProviderPool so upstream 5xx responses that are actually permanent client media-fetch/decode failures (broken image/video URLs, unsupported media, etc.) do not get retried and amplified into repeated backend work. This targets a specific operational failure mode where inference engines (vLLM/SGLang) return 500 for client-supplied media URL failures.

Changes:

  • Add is_client_media_fetch_error() heuristic matcher for high-confidence media-fetch/decode failure markers in upstream error messages.
  • Update classify_retry_decision() to return a non-retryable label for 5xx responses matching those markers.
  • Extend test_classify_retry_decision with new cases covering common production error strings and a negative control.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

|| lower.contains("failed to open input buffer")
// aiohttp wrapper format: "HTTP error 500: 4xx, message='...', url='http..."
// — the inference engine collapsed a client-fetch 4xx into a 500.
|| (lower.contains(", url='http") && lower.contains(", message='"))
/// the same failure. Treat these as non-retryable so one broken URL
/// from a client doesn't get amplified into 4x backend work.
fn is_client_media_fetch_error(message: &str) -> bool {
let lower = message.to_lowercase();
assert_eq!(
InferenceProviderPool::classify_retry_decision(&CompletionError::HttpError {
status_code: 500,
message: "Internal server error: An exception occurred while loading VIDEO data at index 0: Error while loading data https://www.youtube.com/watch?v=2w_pbUrVZHY: SingleStreamDecoder, Failed to open input buffer: Invalid data found when processing input".to_string(),
assert_eq!(
InferenceProviderPool::classify_retry_decision(&CompletionError::HttpError {
status_code: 500,
message: "HTTP error 500: 404, message='Not Found', url='https://www.facebook.com/v24.0/122181986942783109_1506385084189441'".to_string(),
@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen May 28, 2026 09:06
Copy link
Copy Markdown
Contributor

@PierreLeGuen PierreLeGuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for one retry-classification regression:

  • crates/services/src/inference_provider_pool/mod.rs:1391: the aiohttp-wrapper branch is broader than the comment says. It marks any 500 body containing , message=' and , url='http as non_retryable_client_media_error, even when the wrapped aiohttp status is another 5xx such as HTTP error 500: 503, message='Service Unavailable', url='https://...'. That skips the existing provider retry behavior for transient upstream/backend 5xxs just because the error string includes a URL. Please constrain this branch to wrapped 4xx media-fetch statuses, or otherwise require a stronger media-fetch marker, and add a negative test for wrapped 5xx staying retryable_http_5xx.

Local check run: cargo test -p services test_classify_retry_decision passed. GitHub lint/security/claude-review checks passed when inspected; the GitHub Test Suite was still pending.

… hygiene

PierreLeGuen, Gemini, and Copilot all flagged the same regression: the
aiohttp-wrapper branch matched any 5xx body containing both
`, url='http` and `, message='`, which would swallow legitimately
retryable wrapped 5xx like `HTTP error 500: 503, message='Service
Unavailable', url='...'`.

Replace the two substring checks with a precise regex against the
wrapper shape that requires the wrapped status to be a 4xx:

  http error 500:\s*4\d{2},\s*message='[^']*',\s*url='https?://

Compiled once via OnceLock, run on the ascii-lowercased message
(swapped to_lowercase -> to_ascii_lowercase — all markers are ASCII
and this path runs at high volume during malformed-media incidents,
per Copilot).

Tests:
- Add the regression case PierreLeGuen called out: 500 wrapping a 503
  must stay `retryable_http_5xx`.
- Add a generic-noise case: a non-wrapper 5xx that happens to contain
  url=... and message=... stays retryable (guards against drift toward
  the old loose substring check).
- Redact verbatim production URLs (Facebook ID, YouTube ID) to
  example.test dummies — the matcher only depends on the marker
  substrings, not the actual host (Copilot).
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env May 28, 2026 11:51 — with GitHub Actions Inactive
@Evrard-Nil
Copy link
Copy Markdown
Collaborator Author

Thanks all — pushed 0906f30 addressing the convergent feedback in one go.

@PierreLeGuen — fixed the aiohttp-wrapper regression. The branch now requires the wrapped status to be a 4xx via regex (http error 500:\s*4\d{2},\s*message='[^']*',\s*url='https?://), so HTTP error 500: 503, message='Service Unavailable', url='...' stays retryable_http_5xx. Added the exact negative test you requested (assertion lines 3601 / 3635 in the diff).

@gemini-code-assist — adopted the regex approach almost verbatim; only difference is I lowercase the message once and match against that (instead of (?i)) since I'm already paying the lowercase cost for the substring branches.

@copilot

  • Switched to to_ascii_lowercase() — markers are all ASCII.
  • Redacted the verbatim Facebook ID + YouTube ID in test fixtures to example.test dummies; the matcher only depends on the marker substrings, not the host.
  • Re your aiohttp tightening note — same fix as PierreLeGuen's, regex with 4xx anchor.

Test count is now 7 assertions (4 positive, 3 negative). The new "500 wrapping 503" and "generic noise with url=/message=" negatives guard against drift back to the old loose substring check.

The Test Suite hang on the first run was in step "Run unit tests" (48m, no completed_at — looks like a step-level hang). My added tests are pure substring/regex matching with no I/O, so unlikely to be the cause. Watching the rerun on this push.

@lloydmak99 @PierreLeGuen ready for another look.

@Evrard-Nil Evrard-Nil requested a review from PierreLeGuen May 28, 2026 11:51
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Thanks for the update, Evrard-Nil. The regex-based approach for the aiohttp wrapper, combined with the explicit 4xx check, correctly addresses the potential regression while maintaining the necessary performance characteristics. The added test coverage for both positive and negative cases (especially the 500-wrapping-503 scenario) provides strong confidence in the fix. This looks solid.

Copy link
Copy Markdown
Contributor

@PierreLeGuen PierreLeGuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking findings:

  • crates/services/src/inference_provider_pool/mod.rs:1405,1712,1735: the aiohttp 500-wrapping-4xx case is tested against the raw message, but the production retry gate classifies last_error after sanitize_completion_error stores it. sanitize_error_message redacts https://... to [URL_REDACTED], so the regex requiring url='https?:// no longer matches and those wrapped client-fetch 4xx errors still retry as retryable_http_5xx. Please classify/store the retry decision before sanitizing, or make the matcher/tests cover the sanitized message used by the retry gate.

  • crates/services/src/inference_provider_pool/mod.rs:1687-1697,1735: the new non-retryable client-media 5xx path is still counted as a provider health failure before the retry decision is applied. Sustained bad client media can still increment provider_failure_counts and demote healthy backends, even though the cause is client input. Please gate the failure counter on the same raw-error retry decision.

Checks run locally:

  • cargo test -p services --lib test_classify_retry_decision passed.
  • cargo check -p services passed.
  • cargo fmt --check failed on formatting in the changed test lines; GitHub Lint is failing too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cloud-api retries media-fetch upstream 500s — 4× amplification → backend saturation

4 participants