feat+test: add drift detection in record_decision and 57 unit tests for FrameTracer/DivergenceDetector#209
Conversation
Makes `evidence` an optional keyword argument (default `None`, treated as `[]`) in `RecordingMixin.record_decision`. All existing callers already pass evidence explicitly so this is non-breaking. Also adds lightweight drift-event collection to `record_decision` and wires `_drift_events`/`_drift_compare_index` onto `TraceContext.restore` so the previously-skipped drift-emission test now passes. Closes #205 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ison fixes - Add `*` after `chosen_action` in `record_decision` to make `evidence` and remaining params keyword-only, preventing accidental positional use and protecting existing positional callers - Use clamped `event.confidence` instead of raw `confidence` in drift event_dict to match what is actually persisted - Add `action` alias alongside `chosen_action` in drift event_dict so baselines using either key are matched - Advance `_drift_compare_index` to the next decision event in the baseline (skipping non-decision events) to prevent index misalignment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers agent_debugger_sdk/core/frame_tracer.py and divergence_detector.py which previously had zero test coverage despite being 600+ line modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds missing unit test coverage for two previously untested SDK research modules (FrameTracer + DivergenceDetector), and adjusts drift-tracking plumbing used during checkpoint restore/replay.
Changes:
- Add
tests/test_frame_tracer_divergence.pywith 57 unit tests coveringframe_tracer.pyanddivergence_detector.py. - Update replay-depth integration test to align with current restore post-checkpoint filtering / traces API response shape.
- Extend drift tracking state on restore and add drift detection/collection during
record_decision.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_replay_depth_l3.py | Updates drift replay test to use timestamp filtering + /traces response shape and asserts drift is collected on context. |
| tests/test_frame_tracer_divergence.py | New test suite providing coverage for FrameTracer + DivergenceDetector public helpers and serialization. |
| agent_debugger_sdk/core/recorders.py | Makes evidence optional, adds drift detection during decision recording, and collects drift events. |
| agent_debugger_sdk/core/context/trace_context.py | Initializes _drift_events and _drift_compare_index on restore to support drift tracking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def record_decision( | ||
| self, | ||
| reasoning: str, | ||
| confidence: float, | ||
| evidence: list[dict[str, Any]], | ||
| chosen_action: str, | ||
| *, | ||
| evidence: list[dict[str, Any]] | None = None, |
There was a problem hiding this comment.
The * placement after chosen_action is intentional — it makes evidence and all subsequent params keyword-only while keeping reasoning, confidence, and chosen_action as positional. This matches the fix applied in PR #207 (commit f380508) and is the goal of this change: callers can now omit evidence without reordering. Callers using the old signature (reasoning, confidence, evidence_list, chosen_action) would break, but that signature was already changed in PR #207; this PR carries the same convention.
| drift_index = getattr(self, "_drift_compare_index", 0) | ||
| event_dict = { | ||
| "event_type": "decision", | ||
| "data": { | ||
| "chosen_action": chosen_action, | ||
| "action": chosen_action, | ||
| "confidence": event.confidence, | ||
| }, | ||
| } | ||
| drift = drift_detector.compare(event_dict, drift_index) | ||
| # Advance to the next decision event in the baseline, skipping non-decision events | ||
| next_index = drift_index + 1 | ||
| original_events = getattr(drift_detector, "original_events", []) | ||
| while next_index < len(original_events) and original_events[next_index].get("event_type") != "decision": | ||
| next_index += 1 | ||
| self._drift_compare_index = next_index |
There was a problem hiding this comment.
Fixed in acdacd5. Before calling compare(), we now advance drift_index forward past any non-decision events in original_events. This ensures the comparison always targets an actual decision event position, preventing silent missed drift when non-decision events appear before the first (or any subsequent) decision in the baseline.
| # Detect drift against the original execution if a detector is active | ||
| drift_detector = getattr(self, "_drift_detector", None) | ||
| if drift_detector is not None: | ||
| drift_index = getattr(self, "_drift_compare_index", 0) | ||
| event_dict = { |
There was a problem hiding this comment.
Acknowledged. The SDK behavior changes (evidence keyword-only + drift detection in record_decision) are prerequisites for the tests to exercise correctly — the tests validate this runtime behavior. The PR description has been noted; if desired, the title can be updated to reflect the dual scope (runtime fix + test coverage).
There was a problem hiding this comment.
PR title and description updated to reflect the dual scope: the SDK behavior changes (keyword-only evidence, drift detection in record_decision, drift state init on restore()) are now called out explicitly alongside the unit tests.
…ift compare Previously drift_compare_index could point at a non-decision event at the start of the baseline (e.g. after restore), causing compare() to silently miss the first decision's drift. Now advance past non-decision events before comparing, then advance to the next decision for the subsequent call. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
evidencekeyword-only inrecord_decision(after*) — callers can now omit it without positional issuesrecord_decision: advances past non-decision events in baseline before comparing, uses clampedconfidence, and includes bothchosen_actionandactionaliases in drift event dict_drift_eventsand_drift_compare_indexonrestore()inTraceContexttests/test_frame_tracer_divergence.pywith 57 unit tests covering two previously untested modules:frame_tracer.pyanddivergence_detector.pyNote on scope
The SDK behavior changes are prerequisites for the tests to exercise drift detection correctly. The drift detection changes were also addressed in PR #207; this PR builds on that baseline.