feat(analytics): split error_kind buckets for clearer install funnel#118
Merged
Conversation
Today error_kind=unknown is ~44% of tool failures in BI — uninformative
because it conflates "user's network died", "user passed bad args",
"workspace mismatch", and "we have a bug" into a single bucket.
This PR splits the existing taxonomy along the dimensions that actually
distinguish recoverable user-side issues from server-side bugs.
New kinds (and what they replace):
- opik_auth_failed — was opik_http_4xx (401 only)
- opik_permission_denied — was opik_http_4xx (403, now separate)
- opik_not_found — was opik_http_4xx (404)
- opik_validation_failed — was opik_http_4xx (400/422)
- comet_permission_denied — was comet_auth_failed (403 only)
- network_error — was unknown (httpx.RequestError family:
ConnectError, TimeoutException, ReadError)
- tool_args_invalid — was unknown (pydantic.ValidationError on
MCP-coerced tool args)
Implementation:
- New OpikPermissionError(OpikAuthError) and CometPermissionError(CometAuthError)
subclasses — backwards compatible: existing ``except OpikAuthError``
callers still catch 403, but the linear _ERROR_KIND_TABLE walk in
analytics/wrappers.py picks the subclass row first (subclass-before-
parent ordering enforced via test).
- httpx.RequestError sits AFTER the typed Opik/Comet rows so a typed
401 isn't mis-bucketed as a network failure.
- httpx.HTTPStatusError is intentionally NOT listed — a raw
HTTPStatusError reaching this layer is a bug (the typed wrappers
should have classified the status first).
Privacy: classifier still keys off exception class only — never
.args / .message — so no PII review needed for the new buckets.
Tests:
- tests/test_analytics_wrappers.py parametrizes all 19 new kind mappings,
including a real pydantic.ValidationError built via model_validate
(can't be instantiated directly).
- tests/test_opik_client.py + tests/test_opik_client_read.py now expect
OpikPermissionError on 403 (was OpikAuthError; subclass means existing
isinstance(exc, OpikAuthError) checks still pass).
BI impact: existing dashboards keying off opik_http_4xx will see that
bucket drop to zero and new specific buckets appear. Receiver-side
filter on event_type=opik_mcp_tool_called is unaffected.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three findings from code review on the error_kind split: 1. HIGH — run_experiment.py also lumps 401+403 into OpikAuthError. The experiment-execute endpoint's 403 path silently lands in the opik_auth_failed bucket instead of opik_permission_denied — defeating the disambiguation for that endpoint entirely. Mirrored the opik_client._raise_for_status split here. New test covers the 403 branch (was missing — only 401 was tested before). 2. HIGH — Comment for the tool_args_invalid bucket claimed it fires "when MCP coerces the tool's typed args". This is false: FastMCP's Tool.run wraps ValidationError from outer arg coercion into a ToolError BEFORE our instrument_tool wrapper sees it. The bucket only fires for inner ``.model_validate()`` calls inside tool bodies (RunExperimentConfig in server.py:399, op.pydantic_model in writes/dispatch.py:164,179). Comment now states that explicitly so future readers don't assume MCP arg-coercion is covered. 3. MEDIUM — Privacy block said "classifier keys off exception class only — never .args" without scoping. The exception messages DO carry workspace/entity strings; they're safe because nothing here serializes them. Tightened the comment so a future maintainer doesn't read it as a blanket "exception payloads are PII-clean" guarantee — added an explicit "no str(exc) in future fields" rule. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI caught a stray multi-line raise that ruff format collapses to a single-line string. Local pre-commit didn't fire because the file wasn't in the format-check path for the previous run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
awkoy
added a commit
that referenced
this pull request
May 22, 2026
Bundles two analytics-reliability changes since 0.1.1: 1. opik_mcp_startup_error reliability (PR #117) - Fixed silent-drop of invalid_config events at module-import time (opik_mcp/__init__.py was eagerly calling get_settings()). - Added _preflight_bind_check so transport_crash fires when uvicorn's port is occupied (uvicorn was swallowing OSError internally). - MRO-walking exception bucketing so PermissionError / ConnectionRefusedError bucket as "OSError" instead of "unknown". 2. error_kind taxonomy split (PR #118) - opik_http_4xx → opik_auth_failed (401), opik_permission_denied (403), opik_not_found (404), opik_validation_failed (400/422). - comet_auth_failed (mixed) → comet_auth_failed (401) + comet_permission_denied (403). - unknown → network_error (httpx.RequestError family) and tool_args_invalid (pydantic.ValidationError from inner model_validate calls). - OpikPermissionError / CometPermissionError subclasses preserve existing except-clauses via inheritance. BI receiver impact: dashboards keying off opik_http_4xx will see that bucket drop to zero and the new specific buckets appear. Receiver-side filter on event_type=opik_mcp_tool_called is unaffected. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
error_kind=unknownis ~44% of tool failures in BI today — uninformative because it conflates network failures, bad user input, workspace mismatches, and actual bugs. This PR splits the taxonomy along dimensions that distinguish recoverable user-side issues from server-side bugs.Before → after
opik_http_4xxopik_auth_failedOpikAuthError(401)opik_http_4xxopik_permission_deniedOpikPermissionError(403)opik_http_4xxopik_not_foundOpikNotFoundError(404)opik_http_4xxopik_validation_failedOpikValidationError(400/422)comet_auth_failedcomet_permission_deniedCometPermissionError(403)unknownnetwork_errorhttpx.RequestErrorfamily (ConnectError, TimeoutException, ReadError)unknowntool_args_invalidpydantic.ValidationErroron MCP-coerced tool argsUnchanged:
comet_auth_failed(401),missing_config,opik_http_5xx,ollie_*,cancelled,unknown(true catch-all).Why these specific splits
network_error: the most common cause of "unknown" today. Lets BI distinguish "user's wifi died mid-call" from "we crashed."tool_args_invalid: pydantic raises before our code runs when MCP coerces the tool's typed args. Bucketing this separately stops "user passed bad input" from polluting the genuine-bug bucket.Why NOT a
workspace_errorbucketWorkspace mismatch surfaces as
opik_auth_failed(401),opik_permission_denied(403), oropik_not_found(404) on workspace endpoints. The existing three cover it without needing a separate kind. Could reconsider once we see real BI shapes.Implementation notes
OpikPermissionError(OpikAuthError)andCometPermissionError(CometAuthError)are subclasses for backwards compat — existingexcept OpikAuthErrorcallers still catch 403._ERROR_KIND_TABLEis a linear walk: subclass rows are listed BEFORE their parents so isinstance picks the specific bucket first. New test pins this ordering.httpx.RequestErrorsits AFTER the typed Opik/Comet rows so a typed 401 isn't mis-bucketed as a network failure.httpx.HTTPStatusErroris intentionally NOT in the table — a rawHTTPStatusErrorreaching this layer means the typed wrappers failed to classify it, which is a bug worth surfacing asunknown.Privacy
Classifier still keys off exception class only — never
.args/.message. No PII review needed for the new buckets.BI receiver impact
Dashboards keying off
opik_http_4xxwill see that bucket drop to zero and the new specific buckets appear. Receiver-side filters onevent_type=opik_mcp_tool_calledare unaffected. Consider updating BI dashboards before this lands in a release.Test plan
uv run pytest— 500 passed (was 494; +6 from new parametrize rows)uv run ruff check— cleanuv run mypy— cleantest_analytics_wrappers.py:pydantic.ValidationErrorbuilt viamodel_validate(can't be instantiated directly)OpikPermissionError("x")→opik_permission_denied(NOT parent'sopik_auth_failed— pins ordering)test_opik_client.py+test_opik_client_read.py: 403 →OpikPermissionError(subclass means existingisinstance(exc, OpikAuthError)still passes)🤖 Generated with Claude Code