Skip to content

feat(error-tracking): integrate Sentry for tool-call and transport failures#125

Merged
alexkuzmik merged 3 commits into
mainfrom
feat/sentry-integration
May 25, 2026
Merged

feat(error-tracking): integrate Sentry for tool-call and transport failures#125
alexkuzmik merged 3 commits into
mainfrom
feat/sentry-integration

Conversation

@alexkuzmik
Copy link
Copy Markdown
Contributor

Details

Adds Sentry-based error reporting for the failures that need a human stack trace — tool-call bugs and transport crashes — alongside the existing analytics funnel. The two are independent: analytics is the low-cardinality install/usage funnel, Sentry is the stack-trace channel. Either can be opted out without affecting the other.

New module src/opik_mcp/error_tracking.py

  • setup_sentry(settings) -> bool — initializes Sentry and binds the global scope. Pytest-guarded.
  • capture_exception(exc, *, tags=, extras=, transaction=, fingerprint=) — pushes a fresh scope per event so concurrent HTTP-mode tool calls don't pollute each other's context.
  • before_send filter: 30-event/process cap (kills retry-loop floods regardless of level — fatal included), pytest defence-in-depth.

DSN handling

  • Hardcoded as a Settings.opik_mcp_sentry_dsn ClassVar — public ingest key, not a secret. Pydantic-settings skips ClassVars, so it's not env-overridable: prevents anyone redirecting crash reports to a foreign Sentry project. Pinned by test_settings_dsn_is_not_env_overridable.

What Sentry sees on a tool failure (read tool hitting a 500, for example)

  • Tags: tool_name=read, error_kind=opik_http_5xx, entity_type=trace, id_kind=uuid, mcp_host=claude-code, mcp_client_version=…, plus globals (release, os_type, python_version, transport, installation_type matching opik SDK taxonomy: cloud/local/self-hosted, workspace, has_workspace_id, has_api_key, github_actions, source).
  • Extras: duration_ms.
  • Transaction: read (puts tool name in the issue listing next to the exception type).
  • Fingerprint: ["{{ default }}", "read"] (splits issues per tool even when a shared helper raises).
  • User: id = workspace UUID → workspace name → install_id fallback.

User-side failures are skipped at the capture site — not filtered server-side. The wrapper's _USER_SIDE_ERROR_KINDS allowlist excludes auth/permission/validation/not-found/missing-config/ollie-config/tool-args buckets; they're already counted in BI under explicit error_kind values, and Sentry only needs the failures that warrant a stack trace.

Transport-start crashes carry matching shape: phase=transport_start / error_kind=transport_crash / exception_type=<MRO-bucketed> / transport=stdio|streamable-http tags, startup transaction, ["{{ default }}", "startup", "transport_crash"] fingerprint.

Settings ValidationError stays analytics-only — the deploy pipeline hardcodes the values that drive validation, so this path is dev-only.

Smoke scripts in scripts/ for regression checks against the live Sentry project: sentry_smoke.py (canned exceptions through fake tool bodies — covers tags/transaction/fingerprint plumbing) and sentry_smoke_realistic.py (real OpikClienthttpx → socket failure against a closed local port — covers the real stack-trace shape).

Change checklist

  • User facing
  • Documentation updated (if needed)
  • Tests added/updated (if needed)
  • Breaking changes documented (if any)

Issues

  • Resolves OPIK-6660

Testing

  • uv run pytest -q — 553 passed, 2 skipped
  • uv run ruff check . — clean
  • uv run ruff format --check . — clean
  • uv run mypy — clean (strict)
  • Verified live with smoke scripts: events land in the opik-mcp Sentry project with the expected tags, transaction, fingerprint, and stack trace.

Documentation

README's ### Telemetry section is currently analytics-only; a follow-up PR will add a parallel ### Error tracking block documenting opt-out and what gets sent. Intentionally left out here to keep the diff focused on the implementation.

🤖 Generated with Claude Code

alexkuzmik and others added 3 commits May 25, 2026 17:22
…ilures

- New `opik_mcp.error_tracking` module wraps sentry_sdk with a thin
  setup_sentry(settings) + capture_exception(exc, tags=..., extras=...,
  transaction=..., fingerprint=...) surface. before_send caps events at
  30/process; pytest guard prevents test events from ever phoning home.
- DSN hardcoded as a Settings ClassVar (not env-overridable — prevents
  redirecting crash reports to a foreign Sentry project). The only
  supported opt-out is OPIK_MCP_SENTRY_ENABLED=false.
- Tool wrapper captures non-user-side failures with full context: tags
  for tool_name, error_kind, mcp_host, mcp_client_version + props_fn
  output (entity_type, operation, is_batch, …); duration_ms as extra;
  transaction=tool_name puts the tool in the Sentry issue listing;
  fingerprint=["{{ default }}", tool_name] splits shared-helper crashes
  per tool. User-side error_kinds (auth/validation/not-found/missing
  config/ollie config/tool args) are skipped at the capture site —
  they're already bucketed in BI and would just be noise.
- __main__ transport_start catch captures with the same shape:
  phase/error_kind/exception_type tags, startup transaction, dedicated
  fingerprint. Settings ValidationError stays analytics-only (deploy
  hardcodes the values that drive validation, so it shouldn't fire in
  production).
- Global scope (set once at setup_sentry) carries release, os_type,
  python_version, transport, installation_type (cloud/local/self-hosted,
  matching opik SDK taxonomy), workspace, has_workspace_id, has_api_key,
  github_actions, source.
- scripts/sentry_smoke.py + sentry_smoke_realistic.py for regression
  checks against the live Sentry project.

Resolves OPIK-6660.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/opik_mcp/analytics/wrappers.py
These were useful for end-to-end verification of the Sentry integration
during development but don't belong in the shipped tree — the test suite
covers every contract (tags / transaction / fingerprint / skip-list)
without phoning home.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@awkoy awkoy self-requested a review May 25, 2026 15:45
@alexkuzmik alexkuzmik merged commit f09a47c into main May 25, 2026
1 check passed
awkoy added a commit that referenced this pull request May 25, 2026
PR #125 (Sentry integration) introduced ``_attach_mcp_client_tags``
that re-walked the ``ctx → session → client_params → clientInfo`` chain
and stamped Sentry tags with the *raw* host/version strings. The
existing ``session_initialized`` BI emit had been bucketing the same
data via ``_classify_mcp_host`` / ``_bucket_mcp_client_version`` since
PR #123 — so a host stamping ``"acme-internal-wrapper-<user>"`` would
collapse to ``"other"`` in BI but leak verbatim into Sentry.

Move the bucketing helpers + the 8-key session-prop extractor into
``analytics/mcp_client_info.py`` and have both call sites consume the
same ``collect_session_props(session)`` output. New regression test
asserts a canary host stamped as ``clientInfo.name`` buckets to
``"other"`` in Sentry tags too.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants