Skip to content

feat(analytics): split error_kind buckets for clearer install funnel#118

Merged
awkoy merged 3 commits into
mainfrom
feat/error-kind-split
May 22, 2026
Merged

feat(analytics): split error_kind buckets for clearer install funnel#118
awkoy merged 3 commits into
mainfrom
feat/error-kind-split

Conversation

@awkoy

@awkoy awkoy commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

error_kind=unknown is ~44% of tool failures in BI today — uninformative because it conflates network failures, bad user input, workspace mismatches, and actual bugs. This PR splits the taxonomy along dimensions that distinguish recoverable user-side issues from server-side bugs.

Before → after

Before After Triggered by
opik_http_4xx opik_auth_failed OpikAuthError (401)
opik_http_4xx opik_permission_denied OpikPermissionError (403)
opik_http_4xx opik_not_found OpikNotFoundError (404)
opik_http_4xx opik_validation_failed OpikValidationError (400/422)
comet_auth_failed comet_permission_denied CometPermissionError (403)
unknown network_error httpx.RequestError family (ConnectError, TimeoutException, ReadError)
unknown tool_args_invalid pydantic.ValidationError on MCP-coerced tool args

Unchanged: comet_auth_failed (401), missing_config, opik_http_5xx, ollie_*, cancelled, unknown (true catch-all).

Why these specific splits

  • 401 vs 403: very different user actions. 401 = key invalid / expired; 403 = key valid but wrong workspace or insufficient permissions. Conflating them hides the workspace-mismatch failure mode entirely.
  • network_error: the most common cause of "unknown" today. Lets BI distinguish "user's wifi died mid-call" from "we crashed."
  • tool_args_invalid: pydantic raises before our code runs when MCP coerces the tool's typed args. Bucketing this separately stops "user passed bad input" from polluting the genuine-bug bucket.

Why NOT a workspace_error bucket

Workspace mismatch surfaces as opik_auth_failed (401), opik_permission_denied (403), or opik_not_found (404) on workspace endpoints. The existing three cover it without needing a separate kind. Could reconsider once we see real BI shapes.

Implementation notes

  • OpikPermissionError(OpikAuthError) and CometPermissionError(CometAuthError) are subclasses for backwards compat — existing except OpikAuthError callers still catch 403.
  • _ERROR_KIND_TABLE is a linear walk: subclass rows are listed BEFORE their parents so isinstance picks the specific bucket first. New test pins this ordering.
  • httpx.RequestError sits AFTER the typed Opik/Comet rows so a typed 401 isn't mis-bucketed as a network failure.
  • httpx.HTTPStatusError is intentionally NOT in the table — a raw HTTPStatusError reaching this layer means the typed wrappers failed to classify it, which is a bug worth surfacing as unknown.

Privacy

Classifier still keys off exception class only — never .args / .message. No PII review needed for the new buckets.

BI receiver impact

Dashboards keying off opik_http_4xx will see that bucket drop to zero and the new specific buckets appear. Receiver-side filters on event_type=opik_mcp_tool_called are unaffected. Consider updating BI dashboards before this lands in a release.

Test plan

  • uv run pytest — 500 passed (was 494; +6 from new parametrize rows)
  • uv run ruff check — clean
  • uv run mypy — clean
  • New parametrized rows in test_analytics_wrappers.py:
    • All 19 (kind, exception) mappings
    • Real pydantic.ValidationError built via model_validate (can't be instantiated directly)
    • OpikPermissionError("x")opik_permission_denied (NOT parent's opik_auth_failed — pins ordering)
  • Updated test_opik_client.py + test_opik_client_read.py: 403 → OpikPermissionError (subclass means existing isinstance(exc, OpikAuthError) still passes)

🤖 Generated with Claude Code

awkoy and others added 3 commits May 22, 2026 13:24
Today error_kind=unknown is ~44% of tool failures in BI — uninformative
because it conflates "user's network died", "user passed bad args",
"workspace mismatch", and "we have a bug" into a single bucket.

This PR splits the existing taxonomy along the dimensions that actually
distinguish recoverable user-side issues from server-side bugs.

New kinds (and what they replace):
- opik_auth_failed         — was opik_http_4xx (401 only)
- opik_permission_denied   — was opik_http_4xx (403, now separate)
- opik_not_found           — was opik_http_4xx (404)
- opik_validation_failed   — was opik_http_4xx (400/422)
- comet_permission_denied  — was comet_auth_failed (403 only)
- network_error            — was unknown (httpx.RequestError family:
                              ConnectError, TimeoutException, ReadError)
- tool_args_invalid        — was unknown (pydantic.ValidationError on
                              MCP-coerced tool args)

Implementation:
- New OpikPermissionError(OpikAuthError) and CometPermissionError(CometAuthError)
  subclasses — backwards compatible: existing ``except OpikAuthError``
  callers still catch 403, but the linear _ERROR_KIND_TABLE walk in
  analytics/wrappers.py picks the subclass row first (subclass-before-
  parent ordering enforced via test).
- httpx.RequestError sits AFTER the typed Opik/Comet rows so a typed
  401 isn't mis-bucketed as a network failure.
- httpx.HTTPStatusError is intentionally NOT listed — a raw
  HTTPStatusError reaching this layer is a bug (the typed wrappers
  should have classified the status first).

Privacy: classifier still keys off exception class only — never
.args / .message — so no PII review needed for the new buckets.

Tests:
- tests/test_analytics_wrappers.py parametrizes all 19 new kind mappings,
  including a real pydantic.ValidationError built via model_validate
  (can't be instantiated directly).
- tests/test_opik_client.py + tests/test_opik_client_read.py now expect
  OpikPermissionError on 403 (was OpikAuthError; subclass means existing
  isinstance(exc, OpikAuthError) checks still pass).

BI impact: existing dashboards keying off opik_http_4xx will see that
bucket drop to zero and new specific buckets appear. Receiver-side
filter on event_type=opik_mcp_tool_called is unaffected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three findings from code review on the error_kind split:

1. HIGH — run_experiment.py also lumps 401+403 into OpikAuthError. The
   experiment-execute endpoint's 403 path silently lands in the
   opik_auth_failed bucket instead of opik_permission_denied — defeating
   the disambiguation for that endpoint entirely. Mirrored the
   opik_client._raise_for_status split here. New test covers the 403
   branch (was missing — only 401 was tested before).

2. HIGH — Comment for the tool_args_invalid bucket claimed it fires
   "when MCP coerces the tool's typed args". This is false: FastMCP's
   Tool.run wraps ValidationError from outer arg coercion into a
   ToolError BEFORE our instrument_tool wrapper sees it. The bucket
   only fires for inner ``.model_validate()`` calls inside tool bodies
   (RunExperimentConfig in server.py:399, op.pydantic_model in
   writes/dispatch.py:164,179). Comment now states that explicitly so
   future readers don't assume MCP arg-coercion is covered.

3. MEDIUM — Privacy block said "classifier keys off exception class
   only — never .args" without scoping. The exception messages DO
   carry workspace/entity strings; they're safe because nothing here
   serializes them. Tightened the comment so a future maintainer
   doesn't read it as a blanket "exception payloads are PII-clean"
   guarantee — added an explicit "no str(exc) in future fields" rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI caught a stray multi-line raise that ruff format collapses to a
single-line string. Local pre-commit didn't fire because the file
wasn't in the format-check path for the previous run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@awkoy awkoy merged commit 4c3e6f8 into main May 22, 2026
1 check passed
@awkoy awkoy mentioned this pull request May 22, 2026
5 tasks
awkoy added a commit that referenced this pull request May 22, 2026
Bundles two analytics-reliability changes since 0.1.1:

1. opik_mcp_startup_error reliability (PR #117)
   - Fixed silent-drop of invalid_config events at module-import time
     (opik_mcp/__init__.py was eagerly calling get_settings()).
   - Added _preflight_bind_check so transport_crash fires when uvicorn's
     port is occupied (uvicorn was swallowing OSError internally).
   - MRO-walking exception bucketing so PermissionError / ConnectionRefusedError
     bucket as "OSError" instead of "unknown".

2. error_kind taxonomy split (PR #118)
   - opik_http_4xx → opik_auth_failed (401), opik_permission_denied (403),
     opik_not_found (404), opik_validation_failed (400/422).
   - comet_auth_failed (mixed) → comet_auth_failed (401) + comet_permission_denied (403).
   - unknown → network_error (httpx.RequestError family) and tool_args_invalid
     (pydantic.ValidationError from inner model_validate calls).
   - OpikPermissionError / CometPermissionError subclasses preserve
     existing except-clauses via inheritance.

BI receiver impact: dashboards keying off opik_http_4xx will see that
bucket drop to zero and the new specific buckets appear. Receiver-side
filter on event_type=opik_mcp_tool_called is unaffected.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant