Skip to content

fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529

Open
oxoxDev wants to merge 7 commits into
tinyhumansai:mainfrom
oxoxDev:fix/obs-rate-limit
Open

fix(observability): drop transient upstream HTTP from Sentry (429/408/502/503/504)#1529
oxoxDev wants to merge 7 commits into
tinyhumansai:mainfrom
oxoxDev:fix/obs-rate-limit

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented May 12, 2026

Summary

  • Treat 429/408/502/503/504 from upstream providers as transient — no per-attempt Sentry report (already retried + failed-over by reliable.rs).
  • Demote transient agent.provider_chat failures to warn!; the aggregate all_exhausted event still surfaces real outages.
  • Defense-in-depth before_send filter in main.rs drops residual transient events that bypass the call-site classifier.

Problem

Three Sentry issues bury actionable signal under per-attempt transient-upstream noise:

Sentry ID Symptom Events
OPENHUMAN-TAURI-2E llm_provider.api_error … 429 Too Many Requests ~1393 / day
OPENHUMAN-TAURI-84 … 503 Service Unavailable ~1050
OPENHUMAN-TAURI-T … 504 Gateway Timeout ~871
OPENHUMAN-TAURI-3Y llm_provider.reliable_chat … All providers/models failed (cascade) 23
OPENHUMAN-TAURI-3Z agent.provider_chat … All providers/models failed (cascade) 23

providers/ops.rs::should_report_provider_http_failure only excluded 429; 503/504 still went per-attempt. agent/harness/tool_loop.rs reported every provider_chat error, duplicating reliable.rs's aggregate.

Solution

  1. providers/ops.rs — expand the no-report set to {429, 408, 502, 503, 504}. Doc comment updated to name the Sentry issues; existing test extended to cover all five transient codes plus auth/4xx/5xx-permanent codes that still report.
  2. agent/harness/tool_loop.rs — classify the chat error via existing is_rate_limited / is_upstream_unhealthy helpers before calling report_error. Transient → tracing::warn! only; permanent → unchanged Sentry path.
  3. main.rsbefore_send filter inspects Sentry event tags (failure=non_2xx + status in transient set) and short-circuits. Four unit tests cover dropped vs kept variants (transient drop, permanent keep, aggregate all_exhausted keep, missing tag keep).

The aggregate report_error in reliable.rs::reliable_chat_* (failure=all_exhausted) intentionally still fires — that's the cascade signal for genuine outages, one event per request.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) — extended skips_sentry_report_for_transient_upstream_statuses to cover transient set + permanent set; added four transient_filter_tests for the before_send filter
  • Diff coverage ≥ 80% — every changed line in the classifier, tool_loop guard, and before_send filter is exercised by the targeted tests. cargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses + cargo test --bin openhuman-core transient_filter both pass; full lib suite not run locally per workspace guidance (let CI matrix run it).
  • Coverage matrix updated — N/A: observability-only fix, no new user-visible feature row
  • All affected feature IDs from the matrix are listed in the PR description under ## RelatedN/A: behaviour-only change
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: no release-cut surface touched
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime/platform: desktop (macOS, Windows, Linux) — Rust core only. No frontend / Tauri-shell changes.
  • Observability: Sentry event volume drops ~3300+/day on the top-three transient noise issues. Real outage signal (reliable_chat … all_exhausted) untouched. No log lost — every transient drop still emits a structured warn! trace with provider, model, iteration, error fields.
  • Compatibility: no API/RPC surface change; no behavioral change for retry/fallback (the error still propagates through reliable.rs exactly as before).
  • Security: secret scrubbing path in before_send unchanged; new filter runs strictly before existing logic.
  • Performance: tag lookup is a BTreeMap get + matches! on a five-arm pattern — negligible per-event overhead.

Related

  • Closes OPENHUMAN-TAURI-2E
  • Closes OPENHUMAN-TAURI-84
  • Closes OPENHUMAN-TAURI-T
  • Closes OPENHUMAN-TAURI-3Y
  • Closes OPENHUMAN-TAURI-3Z
  • Refs Track and fix active Sentry issues #1472 (Sentry triage meta-issue)
  • Follow-up PR(s)/TODOs: insufficient-budget (-3M, ~4 events) handled by same should_report_provider_http_failure change since it returns 400 — wait, 400 is reportable; deliberately left alone (legitimate misconfig surface)

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A — Sentry-driven, not Linear
  • URL: N/A

Commit & Branch

  • Branch: fix/obs-rate-limit
  • Commit SHA: see git log on PR

Validation Run

  • pnpm --filter openhuman-app format:check — N/A: no app/ changes
  • pnpm typecheck — N/A: no TS changes
  • Focused tests:
    • cargo test --lib openhuman::providers::ops::tests::skips_sentry_report_for_transient_upstream_statuses (1 passed)
    • cargo test --bin openhuman-core transient_filter (4 passed)
  • Rust fmt/check (if changed): cargo fmt --check clean; cargo check --manifest-path Cargo.toml --bins clean
  • Tauri fmt/check (if changed): N/A — no app/src-tauri/ changes

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: per-attempt 429/408/502/503/504 LLM provider failures no longer produce Sentry events; agent.provider_chat transient errors no longer produce Sentry events; new before_send defense-in-depth filter for any future bypass
  • User-visible effect: none direct — observability change. Indirect benefit: real Sentry issues become legible because transient noise drops.

Parity Contract

  • Legacy behavior preserved: error still propagates through reliable.rs retry/fallback exactly as before; aggregate all_exhausted Sentry event still fires for genuine outages
  • Guard/fallback/dispatch parity checks: tests assert both "drop transient" and "keep permanent" / "keep aggregate" branches

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: this PR
  • Resolution: N/A

Summary by CodeRabbit

  • Bug Fixes

    • Reduce monitoring noise by suppressing per-attempt reports for transient upstream/provider HTTP failures (expanded set: rate limits, timeouts, temporary outages) while preserving permanent and aggregate failure reports.
  • Tests

    • Add runtime smoke tests and unit tests to validate transient filtering and ensure permanent/aggregate events still report.
  • Documentation

    • Expanded observability docs describing the reporting/filtering behavior.

Review Change Stack

oxoxDev and others added 3 commits May 12, 2026 16:00
…3/504

Transient upstream statuses are retried + faled-over by reliable.rs; the
aggregate "all providers exhausted" event already covers genuine outages.
Reporting each individual transient failure floods Sentry — OPENHUMAN-TAURI-2E
(~1393 events/day), -84 (~1050), -T (~871) all originate from this path.

Joins 429 (already excluded) with 408/502/503/504 in the no-report set.
Extends the existing classifier test to cover all five transient codes
and verifies auth/4xx/5xx-permanent codes still report.

Refs OPENHUMAN-TAURI-2E, 84, T
Refs tinyhumansai#1472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reliable.rs already retries rate-limited / upstream-unhealthy failures
across providers + models, and emits a single aggregated
"All providers/models failed" Sentry event when every attempt is
exhausted. The per-iteration report_error() in tool_loop duplicates
that signal — OPENHUMAN-TAURI-3Y / 3Z (~46 events) sit on top of the
raw per-attempt floods (-2E/84/T, ~3300 events).

Classify the error before reporting: rate-limit + upstream-unhealthy
demote to warn-level structured log; permanent failures still escalate
to Sentry via report_error.

Refs OPENHUMAN-TAURI-3Y, 3Z
Refs tinyhumansai#1472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p events

Defense-in-depth: even with call-site classifier
(should_report_provider_http_failure) excluding 429/408/502/503/504,
any future call site that emits a tracing::error! with the same shape
(failure=non_2xx + status=<transient>) would still hit Sentry.

is_transient_provider_http_failure inspects tags on Sentry events at
the before_send hook and short-circuits the same set. Aggregate
all_exhausted reports + auth/4xx/permanent failures unchanged.

Four unit tests cover the dropped + kept variants.

Refs OPENHUMAN-TAURI-2E, 84, T
Refs tinyhumansai#1472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oxoxDev oxoxDev requested a review from a team May 12, 2026 10:34
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Per-attempt provider HTTP failures with statuses 408, 429, 502, 503, 504 are classified as transient and suppressed from per-iteration Sentry reporting via provider-level predicates, tool-loop conditional reporting, a core classifier, and a before_send defense-in-depth hook; tests and a runtime smoke test validate behavior.

Changes

Transient provider failure suppression

Layer / File(s) Summary
Provider HTTP transient status classification
src/openhuman/providers/ops.rs
should_report_provider_http_failure expanded to treat 408, 429, 502, 503, 504 as transient (do not report per-attempt); tests updated to reflect the broader transient set.
Provider upstream-unhealthy classifier & tests
src/openhuman/providers/reliable.rs, src/openhuman/providers/reliable_tests.rs
is_upstream_unhealthy now downcasts reqwest::Error, treats 408/502/503/504 as transient, and recognizes additional textual upstream-unavailable markers; tests added for 408/502/504 cases.
Tool-loop transient failure handling
src/openhuman/agent/harness/tool_loop.rs
provider.chat error handling treats rate-limit/upstream-unhealthy as transient: logs a warning and skips per-iteration report_error; non-transient failures still call report_error.
Core observability classifier and tests
src/core/observability.rs
Adds pub const TRANSIENT_PROVIDER_HTTP_STATUSES and pub fn is_transient_provider_http_failure that inspects Sentry event tags (domain=llm_provider, failure=non_2xx, numeric status) and unit tests covering drop/keep/edge cases.
Sentry before_send wiring and runtime smoke tests
src/main.rs, tests/observability_smoke.rs, Cargo.toml
Adds a before_send defense-in-depth filter returning None for events classified transient; adds sentry dev-dep with test feature and runtime smoke tests using TestTransport to assert transient events are dropped while others are delivered.

Sequence Diagram

sequenceDiagram
  participant Client
  participant ToolLoop
  participant ProviderChat
  participant SentryClient
  participant TestTransport

  Client->>ToolLoop: trigger tool call
  ToolLoop->>ProviderChat: make provider.chat request
  ProviderChat-->>ToolLoop: error (status)
  ToolLoop->>ToolLoop: classify transient (is_rate_limited/is_upstream_unhealthy)
  alt transient
    ToolLoop->>ToolLoop: log warning, skip report_error
  else non-transient
    ToolLoop->>SentryClient: report_error event
  end
  SentryClient->>SentryClient: before_send filter (is_transient_provider_http_failure)
  alt dropped
    SentryClient->>TestTransport: nothing (event suppressed)
  else kept
    SentryClient->>TestTransport: deliver envelope
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • senamakel

Poem

🐰 I hop through logs with careful paws,

I sniff the transient stormy flaws.
408, 429, gateways of woe,
Quietly I let the small ones go.
Sentry breathes easy — less bustle, more calm.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: preventing Sentry reporting of transient upstream HTTP failures (429/408/502/503/504) while preserving retry and aggregate signals.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/openhuman/providers/ops.rs (1)

156-164: ⚡ Quick win

Avoid duplicated transient-status lists across observability layers.

This status set is now duplicated here and in src/main.rs; keeping one shared source of truth will prevent future drift between call-site suppression and before_send filtering.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/providers/ops.rs` around lines 156 - 164, The transient HTTP
status set is duplicated between should_report_provider_http_failure and the
before_send filtering in main.rs; extract the shared list into a single public
constant or helper function (e.g., a pub static/const or pub fn
transient_statuses() -> &'static [reqwest::StatusCode]) in a common module,
update should_report_provider_http_failure to reference that shared symbol
instead of enumerating statuses, and update the before_send filter to use the
same shared symbol so both call sites use one source of truth (ensure the type
is reqwest::StatusCode and exported for use by both modules).
src/main.rs (1)

154-162: ⚡ Quick win

Narrow transient filter matching to provider-shaped events.

Current matching on only failure + status can suppress unrelated events that happen to use the same tags. Add one provider-specific guard (e.g., provider tag present) to avoid accidental drops.

Suggested hardening
 fn is_transient_provider_http_failure(event: &sentry::protocol::Event<'_>) -> bool {
     let tags = &event.tags;
+    if tags.get("provider").is_none() {
+        return false;
+    }
     if tags.get("failure").map(String::as_str) != Some("non_2xx") {
         return false;
     }
     matches!(
         tags.get("status").map(String::as_str),
         Some("429") | Some("408") | Some("502") | Some("503") | Some("504")
     )
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/main.rs` around lines 154 - 162, The current
is_transient_provider_http_failure function only checks tags "failure" and
"status", which can match unrelated events; update this function to first verify
a provider-specific tag exists (e.g., require tags.get("provider").is_some() or
check tags.get("provider").map(String::as_str) matches expected provider names)
before applying the existing "failure" == "non_2xx" and transient "status"
matches; modify the guard at the start of is_transient_provider_http_failure to
return false when the provider tag is missing or not one of the provider-shaped
values so only provider-related events are filtered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/tool_loop.rs`:
- Around line 419-421: The transient-error detection misses HTTP 408 (and other
upstream transient statuses), causing report_error() to run for per-iteration
timeouts; update the helper(s) so transient classification includes status code
checks: either extend is_rate_limited() to return true when status.as_u16() ==
408 (and consider 502, 504) or extend is_upstream_unhealthy() to check
resp/status codes (408, 502, 504) in addition to its text patterns, then ensure
tool_loop.rs transient branch (where is_rate_limited(&e) ||
is_upstream_unhealthy(&e) is evaluated) will treat 408 as transient; add unit
tests in reliable_tests.rs that construct error/responses representing 408 (and
optionally 502/504) and assert the helpers return true.

---

Nitpick comments:
In `@src/main.rs`:
- Around line 154-162: The current is_transient_provider_http_failure function
only checks tags "failure" and "status", which can match unrelated events;
update this function to first verify a provider-specific tag exists (e.g.,
require tags.get("provider").is_some() or check
tags.get("provider").map(String::as_str) matches expected provider names) before
applying the existing "failure" == "non_2xx" and transient "status" matches;
modify the guard at the start of is_transient_provider_http_failure to return
false when the provider tag is missing or not one of the provider-shaped values
so only provider-related events are filtered.

In `@src/openhuman/providers/ops.rs`:
- Around line 156-164: The transient HTTP status set is duplicated between
should_report_provider_http_failure and the before_send filtering in main.rs;
extract the shared list into a single public constant or helper function (e.g.,
a pub static/const or pub fn transient_statuses() -> &'static
[reqwest::StatusCode]) in a common module, update
should_report_provider_http_failure to reference that shared symbol instead of
enumerating statuses, and update the before_send filter to use the same shared
symbol so both call sites use one source of truth (ensure the type is
reqwest::StatusCode and exported for use by both modules).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a8e9e9c3-4c21-4bc0-981f-98615d38ed29

📥 Commits

Reviewing files that changed from the base of the PR and between 78d1f3d and f380d0e.

📒 Files selected for processing (3)
  • src/main.rs
  • src/openhuman/agent/harness/tool_loop.rs
  • src/openhuman/providers/ops.rs

Comment thread src/openhuman/agent/harness/tool_loop.rs
Promote is_transient_provider_http_failure from main.rs into
openhuman_core::core::observability so an integration test can drive
the real sentry::init + before_send + transport chain end-to-end.

tests/observability_smoke.rs uses sentry::test::TestTransport (gated
behind sentry's `test` feature in dev-dependencies) to capture
envelopes after they pass before_send. Four scenarios:

  * drops_per_attempt_429_503_504_408_502 — captured=0
  * keeps_permanent_failures (400/401/403/404/500) — captured=5
  * keeps_aggregate_all_exhausted_event — captured=1
  * keeps_event_missing_status_tag — captured=1

This proves the runtime path matches the unit-tested logic and
catches any regression where sentry-rust changes the before_send
contract or tag visibility.

Refs OPENHUMAN-TAURI-2E, 84, T
Refs tinyhumansai#1472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/observability.rs`:
- Around line 83-89: The drop rule currently returns true for any event with
tags.get("failure") == "non_2xx" and transient status codes, which can suppress
unrelated events; tighten it by also checking the provider provenance tag (e.g.,
require tags.get("domain").map(String::as_str) == Some("provider")) before
returning true in the same block (the code around the matches! call using tags,
status, and failure in observability.rs, likely inside your before_send logic).
Update the condition so that all three conditions must hold (failure ==
"non_2xx", status in {429,408,502,503,504}, and domain == "provider"), and add a
regression test that constructs an event with failure="non_2xx" and a transient
status but domain!="provider" and asserts the event is kept.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3982d145-2a2a-42e7-a197-37094db5d869

📥 Commits

Reviewing files that changed from the base of the PR and between f380d0e and 7c2fa76.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • Cargo.toml
  • src/core/observability.rs
  • src/main.rs
  • tests/observability_smoke.rs
✅ Files skipped from review due to trivial changes (1)
  • Cargo.toml

Comment thread src/core/observability.rs Outdated
oxoxDev and others added 3 commits May 12, 2026 17:11
Per CodeRabbit review on tinyhumansai#1529: the transient status set (408/429/502/
503/504) was duplicated between providers/ops.rs (typed StatusCode
match) and core/observability.rs (string match in before_send). Extract
a single pub const TRANSIENT_PROVIDER_HTTP_STATUSES: &[u16] so the two
suppression layers stay in lockstep — update here and both call sites
pick it up.

No behavior change: every status that was filtered before is still
filtered; every status that was reported before is still reported. The
before_send filter now parses the status tag as u16 (with the
parse-failure path joining the existing "missing tag" branch — keep,
not drop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit review on tinyhumansai#1529 (Critical, tool_loop.rs:419-421): the
transient classifier was missing 408/502/504. `is_rate_limited()`
covers 429 only; `is_upstream_unhealthy()` was text-pattern based
("no healthy upstream", "service unavailable") and didn't catch the
formatted "<provider> API error (<status>): …" string that
`ops::api_error` actually emits. Result: 408/502/504 leaked
per-iteration Sentry events even though `should_report_provider_http_
failure` filtered them at the api_error site — the tool_loop guard
`is_rate_limited(&e) || is_upstream_unhealthy(&e)` returned false and
called `report_error` for each retried attempt.

Fix: extend is_upstream_unhealthy with (a) a `reqwest::Error` downcast
that checks `status.as_u16()` against {408, 502, 503, 504} for typed
errors, and (b) text-pattern fallbacks ("408 request timeout",
"502 bad gateway", "504 gateway timeout") that match the
`ops::api_error` format ultimately surfacing through anyhow's
to_string() chain.

Tests: three new cases in reliable_tests.rs covering the formatted
strings; existing 503/no-healthy-upstream/service-unavailable cases
and the negative case still pass.

Refs OPENHUMAN-TAURI-2E / -84 / -T (per-iteration noise classes the
parent PR targets).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per CodeRabbit review on tinyhumansai#1529 (Major, observability.rs:81-89): the
before_send filter matched on `failure=non_2xx` + `status` alone, so an
unrelated subsystem emitting those tags would have its event silently
dropped. Add a `domain == "llm_provider"` precondition so only
provider-originated events qualify for the transient filter.

All current emit sites at `providers/ops.rs:218-227` already pass
`"llm_provider"` as the domain argument to `report_error`, which the
helper writes as a tag — so behavior is preserved for the target
events; the guard only narrows what *future* call sites can match.

Tests updated to include the domain tag in the existing fixtures, plus
two new regression cases: events with no domain tag, and events from
a different domain (`scheduler`) reusing the same `failure`/`status`
shape. Both must surface to Sentry rather than being dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oxoxDev
Copy link
Copy Markdown
Contributor Author

oxoxDev commented May 12, 2026

Hey @graycyrus — ready for review when you have a minute.

Addressed all three items from the CodeRabbit pass in three micro-commits (one logical change each, GPG-signed):

Commit What CR item
4b5ede24 Extract TRANSIENT_PROVIDER_HTTP_STATUSES: &[u16] in core/observability.rs; both ops::should_report_provider_http_failure and is_transient_provider_http_failure now reference the same slice — no more drift between the two layers. Pure refactor, no behavior change. ⚡ Nitpick — duplicated status set
1388d790 The real bug. is_upstream_unhealthy was only text-pattern matching ("no healthy upstream" / "service unavailable") and missed the "<provider> API error (408 Request Timeout): …" format that ops::api_error actually emits. Added a reqwest::Error downcast for typed 408/502/503/504 plus the three matching text fallbacks. is_rate_limited (429) stays unchanged. Three new tests in reliable_tests.rs covering each new pattern. 🔴 Critical — 408 leaked per-iteration
0ebe64fa Added domain == "llm_provider" precondition to is_transient_provider_http_failure. Every current emit site already passes that domain to report_error, so behavior is preserved for target events — the guard only narrows what future call sites can accidentally match. Updated fixtures + added two regression tests (no-domain-tag and scheduler domain reusing the same shape both surface). 🟠 Major — filter scope too broad

All targeted tests pass locally:

  • cargo test --lib core::observability::tests::transient_filter — 6 passed
  • cargo test --lib openhuman::providers::reliable::tests::upstream_unhealthy — 7 passed (3 new + 4 existing)
  • cargo test --test observability_smoke — 4 passed (integration via sentry::test::TestTransport)
  • cargo fmt --check clean, cargo check --lib --tests clean (only pre-existing warnings in unrelated webview_accounts).

No changes to the headline transient-set semantics or the aggregate all_exhausted cascade signal — the per-attempt 429/408/502/503/504 drop and the actionable 4xx/5xx-permanent surface are still the contract. CI is re-running.

Let me know if anything else looks off.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/openhuman/providers/reliable.rs (1)

91-106: ⚡ Quick win

Unify transient-status classification here to reduce drift and text-fragility

is_upstream_unhealthy still hardcodes transient codes/patterns locally. Since transient suppression now has a shared status set, this path can drift over time and may miss numeric-only text forms unless the exact phrase matches.

Proposed refactor
 pub(crate) fn is_upstream_unhealthy(err: &anyhow::Error) -> bool {
+    let is_upstream_transient = |code: u16| {
+        code != 429
+            && crate::core::observability::TRANSIENT_PROVIDER_HTTP_STATUSES.contains(&code)
+    };
+
     if let Some(reqwest_err) = err.downcast_ref::<reqwest::Error>() {
         if let Some(status) = reqwest_err.status() {
-            if matches!(status.as_u16(), 408 | 502 | 503 | 504) {
+            if is_upstream_transient(status.as_u16()) {
                 return true;
             }
         }
     }
-    let lower = err.to_string().to_lowercase();
+    let lower = err.to_string().to_lowercase();
+    for token in lower.split(|c: char| !c.is_ascii_digit()) {
+        if let Ok(code) = token.parse::<u16>() {
+            if is_upstream_transient(code) {
+                return true;
+            }
+        }
+    }
     lower.contains("no healthy upstream")
         || lower.contains("upstream unavailable")
         || lower.contains("service unavailable")
-        || lower.contains("408 request timeout")
-        || lower.contains("502 bad gateway")
-        || lower.contains("504 gateway timeout")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/providers/reliable.rs` around lines 91 - 106, The
is_upstream_unhealthy function currently hardcodes transient HTTP codes and
phrase matches which can drift; replace the local hardcoded checks with the
shared transient-status classifier (e.g., the centralized set/function used
elsewhere — call it something like TRANSIENT_STATUSES or is_transient_status/
is_transient_http_status) by: 1) when downcasting to reqwest::Error, query the
shared status-set with the numeric status code instead of matching
408/502/503/504 inline; 2) for string-based errors, normalize the error string
and search for any status codes or reason phrases that the shared classifier
exposes (or call a helper that tests the string against the shared transient
patterns) rather than repeating literal phrase checks in is_upstream_unhealthy;
update imports and function calls to reference the shared symbol so this
function stays in sync with the single source of truth.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/openhuman/providers/reliable.rs`:
- Around line 91-106: The is_upstream_unhealthy function currently hardcodes
transient HTTP codes and phrase matches which can drift; replace the local
hardcoded checks with the shared transient-status classifier (e.g., the
centralized set/function used elsewhere — call it something like
TRANSIENT_STATUSES or is_transient_status/ is_transient_http_status) by: 1) when
downcasting to reqwest::Error, query the shared status-set with the numeric
status code instead of matching 408/502/503/504 inline; 2) for string-based
errors, normalize the error string and search for any status codes or reason
phrases that the shared classifier exposes (or call a helper that tests the
string against the shared transient patterns) rather than repeating literal
phrase checks in is_upstream_unhealthy; update imports and function calls to
reference the shared symbol so this function stays in sync with the single
source of truth.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e74df2ca-6201-4d53-a225-bb99f34968b4

📥 Commits

Reviewing files that changed from the base of the PR and between 7c2fa76 and 0ebe64f.

📒 Files selected for processing (5)
  • src/core/observability.rs
  • src/openhuman/providers/ops.rs
  • src/openhuman/providers/reliable.rs
  • src/openhuman/providers/reliable_tests.rs
  • tests/observability_smoke.rs
✅ Files skipped from review due to trivial changes (1)
  • src/openhuman/providers/reliable_tests.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/observability_smoke.rs
  • src/core/observability.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant