feat(analytics): split error_kind buckets for clearer install funnel by awkoy · Pull Request #118 · comet-ml/opik-mcp

awkoy · 2026-05-22T11:28:32Z

Summary

error_kind=unknown is ~44% of tool failures in BI today — uninformative because it conflates network failures, bad user input, workspace mismatches, and actual bugs. This PR splits the taxonomy along dimensions that distinguish recoverable user-side issues from server-side bugs.

Before → after

Before	After	Triggered by
`opik_http_4xx`	`opik_auth_failed`	`OpikAuthError` (401)
`opik_http_4xx`	`opik_permission_denied`	`OpikPermissionError` (403)
`opik_http_4xx`	`opik_not_found`	`OpikNotFoundError` (404)
`opik_http_4xx`	`opik_validation_failed`	`OpikValidationError` (400/422)
`comet_auth_failed`	`comet_permission_denied`	`CometPermissionError` (403)
`unknown`	`network_error`	`httpx.RequestError` family (ConnectError, TimeoutException, ReadError)
`unknown`	`tool_args_invalid`	`pydantic.ValidationError` on MCP-coerced tool args

Unchanged: comet_auth_failed (401), missing_config, opik_http_5xx, ollie_*, cancelled, unknown (true catch-all).

Why these specific splits

401 vs 403: very different user actions. 401 = key invalid / expired; 403 = key valid but wrong workspace or insufficient permissions. Conflating them hides the workspace-mismatch failure mode entirely.
network_error: the most common cause of "unknown" today. Lets BI distinguish "user's wifi died mid-call" from "we crashed."
tool_args_invalid: pydantic raises before our code runs when MCP coerces the tool's typed args. Bucketing this separately stops "user passed bad input" from polluting the genuine-bug bucket.

Why NOT a `workspace_error` bucket

Workspace mismatch surfaces as opik_auth_failed (401), opik_permission_denied (403), or opik_not_found (404) on workspace endpoints. The existing three cover it without needing a separate kind. Could reconsider once we see real BI shapes.

Implementation notes

OpikPermissionError(OpikAuthError) and CometPermissionError(CometAuthError) are subclasses for backwards compat — existing except OpikAuthError callers still catch 403.
_ERROR_KIND_TABLE is a linear walk: subclass rows are listed BEFORE their parents so isinstance picks the specific bucket first. New test pins this ordering.
httpx.RequestError sits AFTER the typed Opik/Comet rows so a typed 401 isn't mis-bucketed as a network failure.
httpx.HTTPStatusError is intentionally NOT in the table — a raw HTTPStatusError reaching this layer means the typed wrappers failed to classify it, which is a bug worth surfacing as unknown.

Privacy

Classifier still keys off exception class only — never .args / .message. No PII review needed for the new buckets.

BI receiver impact

Dashboards keying off opik_http_4xx will see that bucket drop to zero and the new specific buckets appear. Receiver-side filters on event_type=opik_mcp_tool_called are unaffected. Consider updating BI dashboards before this lands in a release.

Test plan

uv run pytest — 500 passed (was 494; +6 from new parametrize rows)
uv run ruff check — clean
uv run mypy — clean
New parametrized rows in test_analytics_wrappers.py:
- All 19 (kind, exception) mappings
- Real pydantic.ValidationError built via model_validate (can't be instantiated directly)
- OpikPermissionError("x") → opik_permission_denied (NOT parent's opik_auth_failed — pins ordering)
Updated test_opik_client.py + test_opik_client_read.py: 403 → OpikPermissionError (subclass means existing isinstance(exc, OpikAuthError) still passes)

🤖 Generated with Claude Code

Today error_kind=unknown is ~44% of tool failures in BI — uninformative because it conflates "user's network died", "user passed bad args", "workspace mismatch", and "we have a bug" into a single bucket. This PR splits the existing taxonomy along the dimensions that actually distinguish recoverable user-side issues from server-side bugs. New kinds (and what they replace): - opik_auth_failed — was opik_http_4xx (401 only) - opik_permission_denied — was opik_http_4xx (403, now separate) - opik_not_found — was opik_http_4xx (404) - opik_validation_failed — was opik_http_4xx (400/422) - comet_permission_denied — was comet_auth_failed (403 only) - network_error — was unknown (httpx.RequestError family: ConnectError, TimeoutException, ReadError) - tool_args_invalid — was unknown (pydantic.ValidationError on MCP-coerced tool args) Implementation: - New OpikPermissionError(OpikAuthError) and CometPermissionError(CometAuthError) subclasses — backwards compatible: existing ``except OpikAuthError`` callers still catch 403, but the linear _ERROR_KIND_TABLE walk in analytics/wrappers.py picks the subclass row first (subclass-before- parent ordering enforced via test). - httpx.RequestError sits AFTER the typed Opik/Comet rows so a typed 401 isn't mis-bucketed as a network failure. - httpx.HTTPStatusError is intentionally NOT listed — a raw HTTPStatusError reaching this layer is a bug (the typed wrappers should have classified the status first). Privacy: classifier still keys off exception class only — never .args / .message — so no PII review needed for the new buckets. Tests: - tests/test_analytics_wrappers.py parametrizes all 19 new kind mappings, including a real pydantic.ValidationError built via model_validate (can't be instantiated directly). - tests/test_opik_client.py + tests/test_opik_client_read.py now expect OpikPermissionError on 403 (was OpikAuthError; subclass means existing isinstance(exc, OpikAuthError) checks still pass). BI impact: existing dashboards keying off opik_http_4xx will see that bucket drop to zero and new specific buckets appear. Receiver-side filter on event_type=opik_mcp_tool_called is unaffected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three findings from code review on the error_kind split: 1. HIGH — run_experiment.py also lumps 401+403 into OpikAuthError. The experiment-execute endpoint's 403 path silently lands in the opik_auth_failed bucket instead of opik_permission_denied — defeating the disambiguation for that endpoint entirely. Mirrored the opik_client._raise_for_status split here. New test covers the 403 branch (was missing — only 401 was tested before). 2. HIGH — Comment for the tool_args_invalid bucket claimed it fires "when MCP coerces the tool's typed args". This is false: FastMCP's Tool.run wraps ValidationError from outer arg coercion into a ToolError BEFORE our instrument_tool wrapper sees it. The bucket only fires for inner ``.model_validate()`` calls inside tool bodies (RunExperimentConfig in server.py:399, op.pydantic_model in writes/dispatch.py:164,179). Comment now states that explicitly so future readers don't assume MCP arg-coercion is covered. 3. MEDIUM — Privacy block said "classifier keys off exception class only — never .args" without scoping. The exception messages DO carry workspace/entity strings; they're safe because nothing here serializes them. Tightened the comment so a future maintainer doesn't read it as a blanket "exception payloads are PII-clean" guarantee — added an explicit "no str(exc) in future fields" rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CI caught a stray multi-line raise that ruff format collapses to a single-line string. Local pre-commit didn't fire because the file wasn't in the format-check path for the previous run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bundles two analytics-reliability changes since 0.1.1: 1. opik_mcp_startup_error reliability (PR #117) - Fixed silent-drop of invalid_config events at module-import time (opik_mcp/__init__.py was eagerly calling get_settings()). - Added _preflight_bind_check so transport_crash fires when uvicorn's port is occupied (uvicorn was swallowing OSError internally). - MRO-walking exception bucketing so PermissionError / ConnectionRefusedError bucket as "OSError" instead of "unknown". 2. error_kind taxonomy split (PR #118) - opik_http_4xx → opik_auth_failed (401), opik_permission_denied (403), opik_not_found (404), opik_validation_failed (400/422). - comet_auth_failed (mixed) → comet_auth_failed (401) + comet_permission_denied (403). - unknown → network_error (httpx.RequestError family) and tool_args_invalid (pydantic.ValidationError from inner model_validate calls). - OpikPermissionError / CometPermissionError subclasses preserve existing except-clauses via inheritance. BI receiver impact: dashboards keying off opik_http_4xx will see that bucket drop to zero and the new specific buckets appear. Receiver-side filter on event_type=opik_mcp_tool_called is unaffected. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

awkoy and others added 3 commits May 22, 2026 13:24

chore: ruff format comet_client.py

083bece

CI caught a stray multi-line raise that ruff format collapses to a single-line string. Local pre-commit didn't fire because the file wasn't in the format-check path for the previous run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

awkoy merged commit 4c3e6f8 into main May 22, 2026
1 check passed

awkoy mentioned this pull request May 22, 2026

chore(release): bump version to 0.1.3 #119

Merged

5 tasks

awkoy mentioned this pull request May 22, 2026

feat(analytics): classify WriteError hierarchy out of unknown #120

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analytics): split error_kind buckets for clearer install funnel#118

feat(analytics): split error_kind buckets for clearer install funnel#118
awkoy merged 3 commits into
mainfrom
feat/error-kind-split

awkoy commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

awkoy commented May 22, 2026

Summary

Before → after

Why these specific splits

Why NOT a workspace_error bucket

Implementation notes

Privacy

BI receiver impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why NOT a `workspace_error` bucket