fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529
fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529oxoxDev wants to merge 7 commits into
Conversation
…3/504 Transient upstream statuses are retried + faled-over by reliable.rs; the aggregate "all providers exhausted" event already covers genuine outages. Reporting each individual transient failure floods Sentry — OPENHUMAN-TAURI-2E (~1393 events/day), -84 (~1050), -T (~871) all originate from this path. Joins 429 (already excluded) with 408/502/503/504 in the no-report set. Extends the existing classifier test to cover all five transient codes and verifies auth/4xx/5xx-permanent codes still report. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reliable.rs already retries rate-limited / upstream-unhealthy failures across providers + models, and emits a single aggregated "All providers/models failed" Sentry event when every attempt is exhausted. The per-iteration report_error() in tool_loop duplicates that signal — OPENHUMAN-TAURI-3Y / 3Z (~46 events) sit on top of the raw per-attempt floods (-2E/84/T, ~3300 events). Classify the error before reporting: rate-limit + upstream-unhealthy demote to warn-level structured log; permanent failures still escalate to Sentry via report_error. Refs OPENHUMAN-TAURI-3Y, 3Z Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p events Defense-in-depth: even with call-site classifier (should_report_provider_http_failure) excluding 429/408/502/503/504, any future call site that emits a tracing::error! with the same shape (failure=non_2xx + status=<transient>) would still hit Sentry. is_transient_provider_http_failure inspects tags on Sentry events at the before_send hook and short-circuits the same set. Aggregate all_exhausted reports + auth/4xx/permanent failures unchanged. Four unit tests cover the dropped + kept variants. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughPer-attempt provider HTTP failures with statuses 408, 429, 502, 503, 504 are classified as transient and suppressed from per-iteration Sentry reporting via provider-level predicates, tool-loop conditional reporting, a core classifier, and a before_send defense-in-depth hook; tests and a runtime smoke test validate behavior. ChangesTransient provider failure suppression
Sequence DiagramsequenceDiagram
participant Client
participant ToolLoop
participant ProviderChat
participant SentryClient
participant TestTransport
Client->>ToolLoop: trigger tool call
ToolLoop->>ProviderChat: make provider.chat request
ProviderChat-->>ToolLoop: error (status)
ToolLoop->>ToolLoop: classify transient (is_rate_limited/is_upstream_unhealthy)
alt transient
ToolLoop->>ToolLoop: log warning, skip report_error
else non-transient
ToolLoop->>SentryClient: report_error event
end
SentryClient->>SentryClient: before_send filter (is_transient_provider_http_failure)
alt dropped
SentryClient->>TestTransport: nothing (event suppressed)
else kept
SentryClient->>TestTransport: deliver envelope
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/openhuman/providers/ops.rs (1)
156-164: ⚡ Quick winAvoid duplicated transient-status lists across observability layers.
This status set is now duplicated here and in
src/main.rs; keeping one shared source of truth will prevent future drift between call-site suppression andbefore_sendfiltering.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/providers/ops.rs` around lines 156 - 164, The transient HTTP status set is duplicated between should_report_provider_http_failure and the before_send filtering in main.rs; extract the shared list into a single public constant or helper function (e.g., a pub static/const or pub fn transient_statuses() -> &'static [reqwest::StatusCode]) in a common module, update should_report_provider_http_failure to reference that shared symbol instead of enumerating statuses, and update the before_send filter to use the same shared symbol so both call sites use one source of truth (ensure the type is reqwest::StatusCode and exported for use by both modules).src/main.rs (1)
154-162: ⚡ Quick winNarrow transient filter matching to provider-shaped events.
Current matching on only
failure+statuscan suppress unrelated events that happen to use the same tags. Add one provider-specific guard (e.g.,providertag present) to avoid accidental drops.Suggested hardening
fn is_transient_provider_http_failure(event: &sentry::protocol::Event<'_>) -> bool { let tags = &event.tags; + if tags.get("provider").is_none() { + return false; + } if tags.get("failure").map(String::as_str) != Some("non_2xx") { return false; } matches!( tags.get("status").map(String::as_str), Some("429") | Some("408") | Some("502") | Some("503") | Some("504") ) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/main.rs` around lines 154 - 162, The current is_transient_provider_http_failure function only checks tags "failure" and "status", which can match unrelated events; update this function to first verify a provider-specific tag exists (e.g., require tags.get("provider").is_some() or check tags.get("provider").map(String::as_str) matches expected provider names) before applying the existing "failure" == "non_2xx" and transient "status" matches; modify the guard at the start of is_transient_provider_http_failure to return false when the provider tag is missing or not one of the provider-shaped values so only provider-related events are filtered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/tool_loop.rs`:
- Around line 419-421: The transient-error detection misses HTTP 408 (and other
upstream transient statuses), causing report_error() to run for per-iteration
timeouts; update the helper(s) so transient classification includes status code
checks: either extend is_rate_limited() to return true when status.as_u16() ==
408 (and consider 502, 504) or extend is_upstream_unhealthy() to check
resp/status codes (408, 502, 504) in addition to its text patterns, then ensure
tool_loop.rs transient branch (where is_rate_limited(&e) ||
is_upstream_unhealthy(&e) is evaluated) will treat 408 as transient; add unit
tests in reliable_tests.rs that construct error/responses representing 408 (and
optionally 502/504) and assert the helpers return true.
---
Nitpick comments:
In `@src/main.rs`:
- Around line 154-162: The current is_transient_provider_http_failure function
only checks tags "failure" and "status", which can match unrelated events;
update this function to first verify a provider-specific tag exists (e.g.,
require tags.get("provider").is_some() or check
tags.get("provider").map(String::as_str) matches expected provider names) before
applying the existing "failure" == "non_2xx" and transient "status" matches;
modify the guard at the start of is_transient_provider_http_failure to return
false when the provider tag is missing or not one of the provider-shaped values
so only provider-related events are filtered.
In `@src/openhuman/providers/ops.rs`:
- Around line 156-164: The transient HTTP status set is duplicated between
should_report_provider_http_failure and the before_send filtering in main.rs;
extract the shared list into a single public constant or helper function (e.g.,
a pub static/const or pub fn transient_statuses() -> &'static
[reqwest::StatusCode]) in a common module, update
should_report_provider_http_failure to reference that shared symbol instead of
enumerating statuses, and update the before_send filter to use the same shared
symbol so both call sites use one source of truth (ensure the type is
reqwest::StatusCode and exported for use by both modules).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a8e9e9c3-4c21-4bc0-981f-98615d38ed29
📒 Files selected for processing (3)
src/main.rssrc/openhuman/agent/harness/tool_loop.rssrc/openhuman/providers/ops.rs
Promote is_transient_provider_http_failure from main.rs into openhuman_core::core::observability so an integration test can drive the real sentry::init + before_send + transport chain end-to-end. tests/observability_smoke.rs uses sentry::test::TestTransport (gated behind sentry's `test` feature in dev-dependencies) to capture envelopes after they pass before_send. Four scenarios: * drops_per_attempt_429_503_504_408_502 — captured=0 * keeps_permanent_failures (400/401/403/404/500) — captured=5 * keeps_aggregate_all_exhausted_event — captured=1 * keeps_event_missing_status_tag — captured=1 This proves the runtime path matches the unit-tested logic and catches any regression where sentry-rust changes the before_send contract or tag visibility. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/core/observability.rs`:
- Around line 83-89: The drop rule currently returns true for any event with
tags.get("failure") == "non_2xx" and transient status codes, which can suppress
unrelated events; tighten it by also checking the provider provenance tag (e.g.,
require tags.get("domain").map(String::as_str) == Some("provider")) before
returning true in the same block (the code around the matches! call using tags,
status, and failure in observability.rs, likely inside your before_send logic).
Update the condition so that all three conditions must hold (failure ==
"non_2xx", status in {429,408,502,503,504}, and domain == "provider"), and add a
regression test that constructs an event with failure="non_2xx" and a transient
status but domain!="provider" and asserts the event is kept.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3982d145-2a2a-42e7-a197-37094db5d869
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
Cargo.tomlsrc/core/observability.rssrc/main.rstests/observability_smoke.rs
✅ Files skipped from review due to trivial changes (1)
- Cargo.toml
Per CodeRabbit review on tinyhumansai#1529: the transient status set (408/429/502/ 503/504) was duplicated between providers/ops.rs (typed StatusCode match) and core/observability.rs (string match in before_send). Extract a single pub const TRANSIENT_PROVIDER_HTTP_STATUSES: &[u16] so the two suppression layers stay in lockstep — update here and both call sites pick it up. No behavior change: every status that was filtered before is still filtered; every status that was reported before is still reported. The before_send filter now parses the status tag as u16 (with the parse-failure path joining the existing "missing tag" branch — keep, not drop). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit review on tinyhumansai#1529 (Critical, tool_loop.rs:419-421): the transient classifier was missing 408/502/504. `is_rate_limited()` covers 429 only; `is_upstream_unhealthy()` was text-pattern based ("no healthy upstream", "service unavailable") and didn't catch the formatted "<provider> API error (<status>): …" string that `ops::api_error` actually emits. Result: 408/502/504 leaked per-iteration Sentry events even though `should_report_provider_http_ failure` filtered them at the api_error site — the tool_loop guard `is_rate_limited(&e) || is_upstream_unhealthy(&e)` returned false and called `report_error` for each retried attempt. Fix: extend is_upstream_unhealthy with (a) a `reqwest::Error` downcast that checks `status.as_u16()` against {408, 502, 503, 504} for typed errors, and (b) text-pattern fallbacks ("408 request timeout", "502 bad gateway", "504 gateway timeout") that match the `ops::api_error` format ultimately surfacing through anyhow's to_string() chain. Tests: three new cases in reliable_tests.rs covering the formatted strings; existing 503/no-healthy-upstream/service-unavailable cases and the negative case still pass. Refs OPENHUMAN-TAURI-2E / -84 / -T (per-iteration noise classes the parent PR targets). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit review on tinyhumansai#1529 (Major, observability.rs:81-89): the before_send filter matched on `failure=non_2xx` + `status` alone, so an unrelated subsystem emitting those tags would have its event silently dropped. Add a `domain == "llm_provider"` precondition so only provider-originated events qualify for the transient filter. All current emit sites at `providers/ops.rs:218-227` already pass `"llm_provider"` as the domain argument to `report_error`, which the helper writes as a tag — so behavior is preserved for the target events; the guard only narrows what *future* call sites can match. Tests updated to include the domain tag in the existing fixtures, plus two new regression cases: events with no domain tag, and events from a different domain (`scheduler`) reusing the same `failure`/`status` shape. Both must surface to Sentry rather than being dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey @graycyrus — ready for review when you have a minute. Addressed all three items from the CodeRabbit pass in three micro-commits (one logical change each, GPG-signed):
All targeted tests pass locally:
No changes to the headline transient-set semantics or the aggregate Let me know if anything else looks off. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/openhuman/providers/reliable.rs (1)
91-106: ⚡ Quick winUnify transient-status classification here to reduce drift and text-fragility
is_upstream_unhealthystill hardcodes transient codes/patterns locally. Since transient suppression now has a shared status set, this path can drift over time and may miss numeric-only text forms unless the exact phrase matches.Proposed refactor
pub(crate) fn is_upstream_unhealthy(err: &anyhow::Error) -> bool { + let is_upstream_transient = |code: u16| { + code != 429 + && crate::core::observability::TRANSIENT_PROVIDER_HTTP_STATUSES.contains(&code) + }; + if let Some(reqwest_err) = err.downcast_ref::<reqwest::Error>() { if let Some(status) = reqwest_err.status() { - if matches!(status.as_u16(), 408 | 502 | 503 | 504) { + if is_upstream_transient(status.as_u16()) { return true; } } } - let lower = err.to_string().to_lowercase(); + let lower = err.to_string().to_lowercase(); + for token in lower.split(|c: char| !c.is_ascii_digit()) { + if let Ok(code) = token.parse::<u16>() { + if is_upstream_transient(code) { + return true; + } + } + } lower.contains("no healthy upstream") || lower.contains("upstream unavailable") || lower.contains("service unavailable") - || lower.contains("408 request timeout") - || lower.contains("502 bad gateway") - || lower.contains("504 gateway timeout") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/providers/reliable.rs` around lines 91 - 106, The is_upstream_unhealthy function currently hardcodes transient HTTP codes and phrase matches which can drift; replace the local hardcoded checks with the shared transient-status classifier (e.g., the centralized set/function used elsewhere — call it something like TRANSIENT_STATUSES or is_transient_status/ is_transient_http_status) by: 1) when downcasting to reqwest::Error, query the shared status-set with the numeric status code instead of matching 408/502/503/504 inline; 2) for string-based errors, normalize the error string and search for any status codes or reason phrases that the shared classifier exposes (or call a helper that tests the string against the shared transient patterns) rather than repeating literal phrase checks in is_upstream_unhealthy; update imports and function calls to reference the shared symbol so this function stays in sync with the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/openhuman/providers/reliable.rs`:
- Around line 91-106: The is_upstream_unhealthy function currently hardcodes
transient HTTP codes and phrase matches which can drift; replace the local
hardcoded checks with the shared transient-status classifier (e.g., the
centralized set/function used elsewhere — call it something like
TRANSIENT_STATUSES or is_transient_status/ is_transient_http_status) by: 1) when
downcasting to reqwest::Error, query the shared status-set with the numeric
status code instead of matching 408/502/503/504 inline; 2) for string-based
errors, normalize the error string and search for any status codes or reason
phrases that the shared classifier exposes (or call a helper that tests the
string against the shared transient patterns) rather than repeating literal
phrase checks in is_upstream_unhealthy; update imports and function calls to
reference the shared symbol so this function stays in sync with the single
source of truth.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e74df2ca-6201-4d53-a225-bb99f34968b4
📒 Files selected for processing (5)
src/core/observability.rssrc/openhuman/providers/ops.rssrc/openhuman/providers/reliable.rssrc/openhuman/providers/reliable_tests.rstests/observability_smoke.rs
✅ Files skipped from review due to trivial changes (1)
- src/openhuman/providers/reliable_tests.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/observability_smoke.rs
- src/core/observability.rs
Summary
reliable.rs).agent.provider_chatfailures towarn!; the aggregateall_exhaustedevent still surfaces real outages.before_sendfilter inmain.rsdrops residual transient events that bypass the call-site classifier.Problem
Three Sentry issues bury actionable signal under per-attempt transient-upstream noise:
llm_provider.api_error … 429 Too Many Requests… 503 Service Unavailable… 504 Gateway Timeoutllm_provider.reliable_chat … All providers/models failed(cascade)agent.provider_chat … All providers/models failed(cascade)providers/ops.rs::should_report_provider_http_failureonly excluded 429; 503/504 still went per-attempt.agent/harness/tool_loop.rsreported every provider_chat error, duplicating reliable.rs's aggregate.Solution
providers/ops.rs— expand the no-report set to{429, 408, 502, 503, 504}. Doc comment updated to name the Sentry issues; existing test extended to cover all five transient codes plus auth/4xx/5xx-permanent codes that still report.agent/harness/tool_loop.rs— classify the chat error via existingis_rate_limited/is_upstream_unhealthyhelpers before callingreport_error. Transient →tracing::warn!only; permanent → unchanged Sentry path.main.rs—before_sendfilter inspects Sentry event tags (failure=non_2xx+statusin transient set) and short-circuits. Four unit tests cover dropped vs kept variants (transient drop, permanent keep, aggregateall_exhaustedkeep, missing tag keep).The aggregate
report_errorinreliable.rs::reliable_chat_*(failure=all_exhausted) intentionally still fires — that's the cascade signal for genuine outages, one event per request.Submission Checklist
skips_sentry_report_for_transient_upstream_statusesto cover transient set + permanent set; added fourtransient_filter_testsfor thebefore_sendfilterbefore_sendfilter is exercised by the targeted tests.cargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses+cargo test --bin openhuman-core transient_filterboth pass; full lib suite not run locally per workspace guidance (let CI matrix run it).N/A: observability-only fix, no new user-visible feature row## Related—N/A: behaviour-only changeN/A: no release-cut surface touchedCloses #NNNin the## RelatedsectionImpact
reliable_chat … all_exhausted) untouched. No log lost — every transient drop still emits a structuredwarn!trace withprovider,model,iteration,errorfields.reliable.rsexactly as before).before_sendunchanged; new filter runs strictly before existing logic.BTreeMapget +matches!on a five-arm pattern — negligible per-event overhead.Related
should_report_provider_http_failurechange since it returns 400 — wait, 400 is reportable; deliberately left alone (legitimate misconfig surface)AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/obs-rate-limitValidation Run
pnpm --filter openhuman-app format:check— N/A: no app/ changespnpm typecheck— N/A: no TS changescargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses(1 passed)cargo test --bin openhuman-core transient_filter(4 passed)cargo fmt --checkclean;cargo check --manifest-path Cargo.toml --binscleanapp/src-tauri/changesValidation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
before_senddefense-in-depth filter for any future bypassParity Contract
all_exhaustedSentry event still fires for genuine outagesDuplicate / Superseded PR Handling
Summary by CodeRabbit
Bug Fixes
Tests
Documentation