fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504) by oxoxDev · Pull Request #1529 · tinyhumansai/openhuman

oxoxDev · 2026-05-12T10:34:33Z

Summary

Treat 429/408/502/503/504 from upstream providers as transient — no per-attempt Sentry report (already retried + failed-over by reliable.rs).
Demote transient agent.provider_chat failures to warn!; the aggregate all_exhausted event still surfaces real outages.
Defense-in-depth before_send filter in main.rs drops residual transient events that bypass the call-site classifier.

Problem

Three Sentry issues bury actionable signal under per-attempt transient-upstream noise:

Sentry ID	Symptom	Events
OPENHUMAN-TAURI-2E	`llm_provider.api_error … 429 Too Many Requests`	~1393 / day
OPENHUMAN-TAURI-84	`… 503 Service Unavailable`	~1050
OPENHUMAN-TAURI-T	`… 504 Gateway Timeout`	~871
OPENHUMAN-TAURI-3Y	`llm_provider.reliable_chat … All providers/models failed` (cascade)	23
OPENHUMAN-TAURI-3Z	`agent.provider_chat … All providers/models failed` (cascade)	23

providers/ops.rs::should_report_provider_http_failure only excluded 429; 503/504 still went per-attempt. agent/harness/tool_loop.rs reported every provider_chat error, duplicating reliable.rs's aggregate.

Solution

providers/ops.rs — expand the no-report set to {429, 408, 502, 503, 504}. Doc comment updated to name the Sentry issues; existing test extended to cover all five transient codes plus auth/4xx/5xx-permanent codes that still report.
agent/harness/tool_loop.rs — classify the chat error via existing is_rate_limited / is_upstream_unhealthy helpers before calling report_error. Transient → tracing::warn! only; permanent → unchanged Sentry path.
main.rs — before_send filter inspects Sentry event tags (failure=non_2xx + status in transient set) and short-circuits. Four unit tests cover dropped vs kept variants (transient drop, permanent keep, aggregate all_exhausted keep, missing tag keep).

The aggregate report_error in reliable.rs::reliable_chat_* (failure=all_exhausted) intentionally still fires — that's the cascade signal for genuine outages, one event per request.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) — extended skips_sentry_report_for_transient_upstream_statuses to cover transient set + permanent set; added four transient_filter_tests for the before_send filter
Diff coverage ≥ 80% — every changed line in the classifier, tool_loop guard, and before_send filter is exercised by the targeted tests. cargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses + cargo test --bin openhuman-core transient_filter both pass; full lib suite not run locally per workspace guidance (let CI matrix run it).
Coverage matrix updated — N/A: observability-only fix, no new user-visible feature row
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: behaviour-only change
No new external network dependencies introduced (mock backend used per Testing Strategy)
Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-cut surface touched
Linked issue closed via Closes #NNN in the ## Related section

Impact

Runtime/platform: desktop (macOS, Windows, Linux) — Rust core only. No frontend / Tauri-shell changes.
Observability: Sentry event volume drops ~3300+/day on the top-three transient noise issues. Real outage signal (reliable_chat … all_exhausted) untouched. No log lost — every transient drop still emits a structured warn! trace with provider, model, iteration, error fields.
Compatibility: no API/RPC surface change; no behavioral change for retry/fallback (the error still propagates through reliable.rs exactly as before).
Security: secret scrubbing path in before_send unchanged; new filter runs strictly before existing logic.
Performance: tag lookup is a BTreeMap get + matches! on a five-arm pattern — negligible per-event overhead.

Closes OPENHUMAN-TAURI-2E
Closes OPENHUMAN-TAURI-84
Closes OPENHUMAN-TAURI-T
Closes OPENHUMAN-TAURI-3Y
Closes OPENHUMAN-TAURI-3Z
Refs Track and fix active Sentry issues #1472 (Sentry triage meta-issue)
Follow-up PR(s)/TODOs: insufficient-budget (-3M, ~4 events) handled by same should_report_provider_http_failure change since it returns 400 — wait, 400 is reportable; deliberately left alone (legitimate misconfig surface)

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A — Sentry-driven, not Linear
URL: N/A

Commit & Branch

Branch: fix/obs-rate-limit
Commit SHA: see git log on PR

Validation Run

pnpm --filter openhuman-app format:check — N/A: no app/ changes
pnpm typecheck — N/A: no TS changes
Focused tests:
- cargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses (1 passed)
- cargo test --bin openhuman-core transient_filter (4 passed)
Rust fmt/check (if changed): cargo fmt --check clean; cargo check --manifest-path Cargo.toml --bins clean
Tauri fmt/check (if changed): N/A — no app/src-tauri/ changes

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: per-attempt 429/408/502/503/504 LLM provider failures no longer produce Sentry events; agent.provider_chat transient errors no longer produce Sentry events; new before_send defense-in-depth filter for any future bypass
User-visible effect: none direct — observability change. Indirect benefit: real Sentry issues become legible because transient noise drops.

Parity Contract

Legacy behavior preserved: error still propagates through reliable.rs retry/fallback exactly as before; aggregate all_exhausted Sentry event still fires for genuine outages
Guard/fallback/dispatch parity checks: tests assert both "drop transient" and "keep permanent" / "keep aggregate" branches

Duplicate / Superseded PR Handling

Duplicate PR(s): N/A
Canonical PR: this PR
Resolution: N/A

Summary by CodeRabbit

Bug Fixes
- Reduce monitoring noise by suppressing per-attempt reports for transient upstream/provider HTTP failures (expanded set: rate limits, timeouts, temporary outages) while preserving permanent and aggregate failure reports.
Tests
- Add runtime smoke tests and unit tests to validate transient filtering and ensure permanent/aggregate events still report.
Documentation
- Expanded observability docs describing the reporting/filtering behavior.

…3/504 Transient upstream statuses are retried + faled-over by reliable.rs; the aggregate "all providers exhausted" event already covers genuine outages. Reporting each individual transient failure floods Sentry — OPENHUMAN-TAURI-2E (~1393 events/day), -84 (~1050), -T (~871) all originate from this path. Joins 429 (already excluded) with 408/502/503/504 in the no-report set. Extends the existing classifier test to cover all five transient codes and verifies auth/4xx/5xx-permanent codes still report. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

reliable.rs already retries rate-limited / upstream-unhealthy failures across providers + models, and emits a single aggregated "All providers/models failed" Sentry event when every attempt is exhausted. The per-iteration report_error() in tool_loop duplicates that signal — OPENHUMAN-TAURI-3Y / 3Z (~46 events) sit on top of the raw per-attempt floods (-2E/84/T, ~3300 events). Classify the error before reporting: rate-limit + upstream-unhealthy demote to warn-level structured log; permanent failures still escalate to Sentry via report_error. Refs OPENHUMAN-TAURI-3Y, 3Z Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p events Defense-in-depth: even with call-site classifier (should_report_provider_http_failure) excluding 429/408/502/503/504, any future call site that emits a tracing::error! with the same shape (failure=non_2xx + status=<transient>) would still hit Sentry. is_transient_provider_http_failure inspects tags on Sentry events at the before_send hook and short-circuits the same set. Aggregate all_exhausted reports + auth/4xx/permanent failures unchanged. Four unit tests cover the dropped + kept variants. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-12T10:35:23Z

📝 Walkthrough

Walkthrough

Per-attempt provider HTTP failures with statuses 408, 429, 502, 503, 504 are classified as transient and suppressed from per-iteration Sentry reporting via provider-level predicates, tool-loop conditional reporting, a core classifier, and a before_send defense-in-depth hook; tests and a runtime smoke test validate behavior.

Changes

Transient provider failure suppression

Layer / File(s)	Summary
Provider HTTP transient status classification `src/openhuman/providers/ops.rs`	`should_report_provider_http_failure` expanded to treat 408, 429, 502, 503, 504 as transient (do not report per-attempt); tests updated to reflect the broader transient set.
Provider upstream-unhealthy classifier & tests `src/openhuman/providers/reliable.rs`, `src/openhuman/providers/reliable_tests.rs`	`is_upstream_unhealthy` now downcasts `reqwest::Error`, treats 408/502/503/504 as transient, and recognizes additional textual upstream-unavailable markers; tests added for 408/502/504 cases.
Tool-loop transient failure handling `src/openhuman/agent/harness/tool_loop.rs`	`provider.chat` error handling treats rate-limit/upstream-unhealthy as transient: logs a warning and skips per-iteration `report_error`; non-transient failures still call `report_error`.
Core observability classifier and tests `src/core/observability.rs`	Adds `pub const TRANSIENT_PROVIDER_HTTP_STATUSES` and `pub fn is_transient_provider_http_failure` that inspects Sentry event tags (`domain=llm_provider`, `failure=non_2xx`, numeric `status`) and unit tests covering drop/keep/edge cases.
Sentry before_send wiring and runtime smoke tests `src/main.rs`, `tests/observability_smoke.rs`, `Cargo.toml`	Adds a `before_send` defense-in-depth filter returning `None` for events classified transient; adds `sentry` dev-dep with `test` feature and runtime smoke tests using `TestTransport` to assert transient events are dropped while others are delivered.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ToolLoop
  participant ProviderChat
  participant SentryClient
  participant TestTransport

  Client->>ToolLoop: trigger tool call
  ToolLoop->>ProviderChat: make provider.chat request
  ProviderChat-->>ToolLoop: error (status)
  ToolLoop->>ToolLoop: classify transient (is_rate_limited/is_upstream_unhealthy)
  alt transient
    ToolLoop->>ToolLoop: log warning, skip report_error
  else non-transient
    ToolLoop->>SentryClient: report_error event
  end
  SentryClient->>SentryClient: before_send filter (is_transient_provider_http_failure)
  alt dropped
    SentryClient->>TestTransport: nothing (event suppressed)
  else kept
    SentryClient->>TestTransport: deliver envelope
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

tinyhumansai/openhuman#1264: Prior PR that introduced/centralized core observability and related classifier work.
tinyhumansai/openhuman#1512: Recent PR that generalized 429-only suppression; this change further centralizes and broadens transient handling.

Suggested reviewers

senamakel

Poem

🐰 I hop through logs with careful paws,

I sniff the transient stormy flaws.
408, 429, gateways of woe,
Quietly I let the small ones go.
Sentry breathes easy — less bustle, more calm.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: preventing Sentry reporting of transient upstream HTTP failures (429/408/502/503/504) while preserving retry and aggregate signals.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/openhuman/providers/ops.rs (1)

156-164: ⚡ Quick win

Avoid duplicated transient-status lists across observability layers.

This status set is now duplicated here and in src/main.rs; keeping one shared source of truth will prevent future drift between call-site suppression and before_send filtering.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/providers/ops.rs` around lines 156 - 164, The transient HTTP
status set is duplicated between should_report_provider_http_failure and the
before_send filtering in main.rs; extract the shared list into a single public
constant or helper function (e.g., a pub static/const or pub fn
transient_statuses() -> &'static [reqwest::StatusCode]) in a common module,
update should_report_provider_http_failure to reference that shared symbol
instead of enumerating statuses, and update the before_send filter to use the
same shared symbol so both call sites use one source of truth (ensure the type
is reqwest::StatusCode and exported for use by both modules).

src/main.rs (1)

154-162: ⚡ Quick win

Narrow transient filter matching to provider-shaped events.

Current matching on only failure + status can suppress unrelated events that happen to use the same tags. Add one provider-specific guard (e.g., provider tag present) to avoid accidental drops.

Suggested hardening

 fn is_transient_provider_http_failure(event: &sentry::protocol::Event<'_>) -> bool {
     let tags = &event.tags;
+    if tags.get("provider").is_none() {
+        return false;
+    }
     if tags.get("failure").map(String::as_str) != Some("non_2xx") {
         return false;
     }
     matches!(
         tags.get("status").map(String::as_str),
         Some("429") | Some("408") | Some("502") | Some("503") | Some("504")
     )
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.rs` around lines 154 - 162, The current
is_transient_provider_http_failure function only checks tags "failure" and
"status", which can match unrelated events; update this function to first verify
a provider-specific tag exists (e.g., require tags.get("provider").is_some() or
check tags.get("provider").map(String::as_str) matches expected provider names)
before applying the existing "failure" == "non_2xx" and transient "status"
matches; modify the guard at the start of is_transient_provider_http_failure to
return false when the provider tag is missing or not one of the provider-shaped
values so only provider-related events are filtered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/tool_loop.rs`:
- Around line 419-421: The transient-error detection misses HTTP 408 (and other
upstream transient statuses), causing report_error() to run for per-iteration
timeouts; update the helper(s) so transient classification includes status code
checks: either extend is_rate_limited() to return true when status.as_u16() ==
408 (and consider 502, 504) or extend is_upstream_unhealthy() to check
resp/status codes (408, 502, 504) in addition to its text patterns, then ensure
tool_loop.rs transient branch (where is_rate_limited(&e) ||
is_upstream_unhealthy(&e) is evaluated) will treat 408 as transient; add unit
tests in reliable_tests.rs that construct error/responses representing 408 (and
optionally 502/504) and assert the helpers return true.

---

Nitpick comments:
In `@src/main.rs`:
- Around line 154-162: The current is_transient_provider_http_failure function
only checks tags "failure" and "status", which can match unrelated events;
update this function to first verify a provider-specific tag exists (e.g.,
require tags.get("provider").is_some() or check
tags.get("provider").map(String::as_str) matches expected provider names) before
applying the existing "failure" == "non_2xx" and transient "status" matches;
modify the guard at the start of is_transient_provider_http_failure to return
false when the provider tag is missing or not one of the provider-shaped values
so only provider-related events are filtered.

In `@src/openhuman/providers/ops.rs`:
- Around line 156-164: The transient HTTP status set is duplicated between
should_report_provider_http_failure and the before_send filtering in main.rs;
extract the shared list into a single public constant or helper function (e.g.,
a pub static/const or pub fn transient_statuses() -> &'static
[reqwest::StatusCode]) in a common module, update
should_report_provider_http_failure to reference that shared symbol instead of
enumerating statuses, and update the before_send filter to use the same shared
symbol so both call sites use one source of truth (ensure the type is
reqwest::StatusCode and exported for use by both modules).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a8e9e9c3-4c21-4bc0-981f-98615d38ed29

📥 Commits

Reviewing files that changed from the base of the PR and between 78d1f3d and f380d0e.

📒 Files selected for processing (3)

src/main.rs
src/openhuman/agent/harness/tool_loop.rs
src/openhuman/providers/ops.rs

Promote is_transient_provider_http_failure from main.rs into openhuman_core::core::observability so an integration test can drive the real sentry::init + before_send + transport chain end-to-end. tests/observability_smoke.rs uses sentry::test::TestTransport (gated behind sentry's `test` feature in dev-dependencies) to capture envelopes after they pass before_send. Four scenarios: * drops_per_attempt_429_503_504_408_502 — captured=0 * keeps_permanent_failures (400/401/403/404/500) — captured=5 * keeps_aggregate_all_exhausted_event — captured=1 * keeps_event_missing_status_tag — captured=1 This proves the runtime path matches the unit-tested logic and catches any regression where sentry-rust changes the before_send contract or tag visibility. Refs OPENHUMAN-TAURI-2E, 84, T Refs tinyhumansai#1472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/observability.rs`:
- Around line 83-89: The drop rule currently returns true for any event with
tags.get("failure") == "non_2xx" and transient status codes, which can suppress
unrelated events; tighten it by also checking the provider provenance tag (e.g.,
require tags.get("domain").map(String::as_str) == Some("provider")) before
returning true in the same block (the code around the matches! call using tags,
status, and failure in observability.rs, likely inside your before_send logic).
Update the condition so that all three conditions must hold (failure ==
"non_2xx", status in {429,408,502,503,504}, and domain == "provider"), and add a
regression test that constructs an event with failure="non_2xx" and a transient
status but domain!="provider" and asserts the event is kept.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3982d145-2a2a-42e7-a197-37094db5d869

📥 Commits

Reviewing files that changed from the base of the PR and between f380d0e and 7c2fa76.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (4)

Cargo.toml
src/core/observability.rs
src/main.rs
tests/observability_smoke.rs

✅ Files skipped from review due to trivial changes (1)

Cargo.toml

Per CodeRabbit review on tinyhumansai#1529: the transient status set (408/429/502/ 503/504) was duplicated between providers/ops.rs (typed StatusCode match) and core/observability.rs (string match in before_send). Extract a single pub const TRANSIENT_PROVIDER_HTTP_STATUSES: &[u16] so the two suppression layers stay in lockstep — update here and both call sites pick it up. No behavior change: every status that was filtered before is still filtered; every status that was reported before is still reported. The before_send filter now parses the status tag as u16 (with the parse-failure path joining the existing "missing tag" branch — keep, not drop). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per CodeRabbit review on tinyhumansai#1529 (Critical, tool_loop.rs:419-421): the transient classifier was missing 408/502/504. `is_rate_limited()` covers 429 only; `is_upstream_unhealthy()` was text-pattern based ("no healthy upstream", "service unavailable") and didn't catch the formatted "<provider> API error (<status>): …" string that `ops::api_error` actually emits. Result: 408/502/504 leaked per-iteration Sentry events even though `should_report_provider_http_ failure` filtered them at the api_error site — the tool_loop guard `is_rate_limited(&e) || is_upstream_unhealthy(&e)` returned false and called `report_error` for each retried attempt. Fix: extend is_upstream_unhealthy with (a) a `reqwest::Error` downcast that checks `status.as_u16()` against {408, 502, 503, 504} for typed errors, and (b) text-pattern fallbacks ("408 request timeout", "502 bad gateway", "504 gateway timeout") that match the `ops::api_error` format ultimately surfacing through anyhow's to_string() chain. Tests: three new cases in reliable_tests.rs covering the formatted strings; existing 503/no-healthy-upstream/service-unavailable cases and the negative case still pass. Refs OPENHUMAN-TAURI-2E / -84 / -T (per-iteration noise classes the parent PR targets). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per CodeRabbit review on tinyhumansai#1529 (Major, observability.rs:81-89): the before_send filter matched on `failure=non_2xx` + `status` alone, so an unrelated subsystem emitting those tags would have its event silently dropped. Add a `domain == "llm_provider"` precondition so only provider-originated events qualify for the transient filter. All current emit sites at `providers/ops.rs:218-227` already pass `"llm_provider"` as the domain argument to `report_error`, which the helper writes as a tag — so behavior is preserved for the target events; the guard only narrows what *future* call sites can match. Tests updated to include the domain tag in the existing fixtures, plus two new regression cases: events with no domain tag, and events from a different domain (`scheduler`) reusing the same `failure`/`status` shape. Both must surface to Sentry rather than being dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

oxoxDev · 2026-05-12T11:45:25Z

Hey @graycyrus — ready for review when you have a minute.

Addressed all three items from the CodeRabbit pass in three micro-commits (one logical change each, GPG-signed):

Commit	What	CR item
`4b5ede24`	Extract `TRANSIENT_PROVIDER_HTTP_STATUSES: &[u16]` in `core/observability.rs`; both `ops::should_report_provider_http_failure` and `is_transient_provider_http_failure` now reference the same slice — no more drift between the two layers. Pure refactor, no behavior change.	⚡ Nitpick — duplicated status set
`1388d790`	The real bug. `is_upstream_unhealthy` was only text-pattern matching ("no healthy upstream" / "service unavailable") and missed the `"<provider> API error (408 Request Timeout): …"` format that `ops::api_error` actually emits. Added a `reqwest::Error` downcast for typed 408/502/503/504 plus the three matching text fallbacks. `is_rate_limited` (429) stays unchanged. Three new tests in `reliable_tests.rs` covering each new pattern.	🔴 Critical — 408 leaked per-iteration
`0ebe64fa`	Added `domain == "llm_provider"` precondition to `is_transient_provider_http_failure`. Every current emit site already passes that domain to `report_error`, so behavior is preserved for target events — the guard only narrows what future call sites can accidentally match. Updated fixtures + added two regression tests (no-domain-tag and `scheduler` domain reusing the same shape both surface).	🟠 Major — filter scope too broad

All targeted tests pass locally:

cargo test --lib core::observability::tests::transient_filter — 6 passed
cargo test --lib openhuman::providers::reliable::tests::upstream_unhealthy — 7 passed (3 new + 4 existing)
cargo test --test observability_smoke — 4 passed (integration via sentry::test::TestTransport)
cargo fmt --check clean, cargo check --lib --tests clean (only pre-existing warnings in unrelated webview_accounts).

No changes to the headline transient-set semantics or the aggregate all_exhausted cascade signal — the per-attempt 429/408/502/503/504 drop and the actionable 4xx/5xx-permanent surface are still the contract. CI is re-running.

Let me know if anything else looks off.

coderabbitai

🧹 Nitpick comments (1)

src/openhuman/providers/reliable.rs (1)

91-106: ⚡ Quick win

Unify transient-status classification here to reduce drift and text-fragility

is_upstream_unhealthy still hardcodes transient codes/patterns locally. Since transient suppression now has a shared status set, this path can drift over time and may miss numeric-only text forms unless the exact phrase matches.

Proposed refactor

 pub(crate) fn is_upstream_unhealthy(err: &anyhow::Error) -> bool {
+    let is_upstream_transient = |code: u16| {
+        code != 429
+            && crate::core::observability::TRANSIENT_PROVIDER_HTTP_STATUSES.contains(&code)
+    };
+
     if let Some(reqwest_err) = err.downcast_ref::<reqwest::Error>() {
         if let Some(status) = reqwest_err.status() {
-            if matches!(status.as_u16(), 408 | 502 | 503 | 504) {
+            if is_upstream_transient(status.as_u16()) {
                 return true;
             }
         }
     }
-    let lower = err.to_string().to_lowercase();
+    let lower = err.to_string().to_lowercase();
+    for token in lower.split(|c: char| !c.is_ascii_digit()) {
+        if let Ok(code) = token.parse::<u16>() {
+            if is_upstream_transient(code) {
+                return true;
+            }
+        }
+    }
     lower.contains("no healthy upstream")
         || lower.contains("upstream unavailable")
         || lower.contains("service unavailable")
-        || lower.contains("408 request timeout")
-        || lower.contains("502 bad gateway")
-        || lower.contains("504 gateway timeout")
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/providers/reliable.rs` around lines 91 - 106, The
is_upstream_unhealthy function currently hardcodes transient HTTP codes and
phrase matches which can drift; replace the local hardcoded checks with the
shared transient-status classifier (e.g., the centralized set/function used
elsewhere — call it something like TRANSIENT_STATUSES or is_transient_status/
is_transient_http_status) by: 1) when downcasting to reqwest::Error, query the
shared status-set with the numeric status code instead of matching
408/502/503/504 inline; 2) for string-based errors, normalize the error string
and search for any status codes or reason phrases that the shared classifier
exposes (or call a helper that tests the string against the shared transient
patterns) rather than repeating literal phrase checks in is_upstream_unhealthy;
update imports and function calls to reference the shared symbol so this
function stays in sync with the single source of truth.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/providers/reliable.rs`:
- Around line 91-106: The is_upstream_unhealthy function currently hardcodes
transient HTTP codes and phrase matches which can drift; replace the local
hardcoded checks with the shared transient-status classifier (e.g., the
centralized set/function used elsewhere — call it something like
TRANSIENT_STATUSES or is_transient_status/ is_transient_http_status) by: 1) when
downcasting to reqwest::Error, query the shared status-set with the numeric
status code instead of matching 408/502/503/504 inline; 2) for string-based
errors, normalize the error string and search for any status codes or reason
phrases that the shared classifier exposes (or call a helper that tests the
string against the shared transient patterns) rather than repeating literal
phrase checks in is_upstream_unhealthy; update imports and function calls to
reference the shared symbol so this function stays in sync with the single
source of truth.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e74df2ca-6201-4d53-a225-bb99f34968b4

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2fa76 and 0ebe64f.

📒 Files selected for processing (5)

src/core/observability.rs
src/openhuman/providers/ops.rs
src/openhuman/providers/reliable.rs
src/openhuman/providers/reliable_tests.rs
tests/observability_smoke.rs

✅ Files skipped from review due to trivial changes (1)

src/openhuman/providers/reliable_tests.rs

🚧 Files skipped from review as they are similar to previous changes (2)

tests/observability_smoke.rs
src/core/observability.rs

oxoxDev and others added 3 commits May 12, 2026 16:00

oxoxDev requested a review from a team May 12, 2026 10:34

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

Comment thread src/openhuman/agent/harness/tool_loop.rs

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

Comment thread src/core/observability.rs Outdated

oxoxDev and others added 3 commits May 12, 2026 17:11

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

coderabbitai Bot approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529

fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529
oxoxDev wants to merge 7 commits into
tinyhumansai:mainfrom
oxoxDev:fix/obs-rate-limit

oxoxDev commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

oxoxDev commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oxoxDev commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oxoxDev commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oxoxDev commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading