Skip to content

feat: live session monitoring with circuit breakers (v1.0.0)#17

Merged
Siddhant-K-code merged 2 commits into
mainfrom
feat/issue-8-watch
Apr 6, 2026
Merged

feat: live session monitoring with circuit breakers (v1.0.0)#17
Siddhant-K-code merged 2 commits into
mainfrom
feat/issue-8-watch

Conversation

@Siddhant-K-code
Copy link
Copy Markdown
Owner

Closes #8

What

Adds agent-strace watch — real-time session monitoring that tails the active session's events.ndjson and fires alerts when configurable thresholds are exceeded.

Usage

# Watch with defaults
agent-strace watch

# Custom thresholds
agent-strace watch --max-retries 3 --max-cost 5 --max-duration 600

# Kill the agent on violation
agent-strace watch --max-retries 3 --on-violation kill

# Config file
agent-strace watch --config .agent-watch.json

Watchers

Watcher Trigger Default
RetryWatcher Same command runs more than N times 5
CostWatcher Estimated cost exceeds threshold $10
DurationWatcher Session exceeds time limit 30 minutes
LoopWatcher Same sequence of events repeats N times
ScopeWatcher File write denied by .agent-scope.json policy any violation

Alert actions

  • terminal — print warning to stderr (default)
  • file — append to .agent-traces/alerts.log
  • webhook — POST JSON to a URL (Slack, PagerDuty, etc.) via stdlib urllib.request
  • kill — send SIGTERM to the agent process

Config file (.agent-watch.json)

{
  "watchers": {
    "retry": { "max": 3, "alert": "terminal" },
    "cost": { "max_dollars": 5.0 },
    "duration": { "max_minutes": 15 },
    "loop": { "sequence_length": 3, "max_repeats": 3 }
  },
  "webhook": {
    "url": "https://hooks.slack.com/services/..."
  }
}

Implementation

  • src/agent_trace/watch.py — watcher state machines, alert dispatcher, file tailer
  • Each violation fires exactly once (deduped by key) to avoid alert spam
  • Loop detection uses a sliding-window sequence comparison over a bounded deque
  • Cost estimation reuses cost.py token estimation (_event_tokens, _dollars)
  • Scope checking reuses audit.py policy loading

Zero new dependencies. 19 new tests covering retry detection, cost threshold, duration limit, loop detection, alert actions (terminal + file), and config loading.

Siddhant-K-code and others added 2 commits April 6, 2026 17:19
Add `agent-strace watch` — tails the active session's events.ndjson
and fires alerts when configurable thresholds are exceeded.

Watchers:
- RetryWatcher: same command fails N times (default: 5)
- CostWatcher: estimated cost exceeds threshold (default: $10)
- DurationWatcher: session exceeds time limit (default: 30m)
- LoopWatcher: same event sequence repeats N times (default: 3x)
- ScopeWatcher: file write denied by .agent-scope.json policy

Alert actions:
- terminal: print warning to stderr (default)
- file: append to .agent-traces/alerts.log
- webhook: POST JSON via stdlib urllib.request
- kill: SIGTERM to agent process

Config via .agent-watch.json or CLI flags. Each violation fires
exactly once (deduped by key). Zero new dependencies.

19 new tests covering retry detection, cost threshold, duration limit,
loop detection, alert actions, and config loading.

Co-authored-by: Ona <no-reply@ona.com>
watch.py:
- Remove unused Callable import
- Remove dead meta load block (PID not stored in meta)
- _dispatch_alert: always write to terminal regardless of action;
  also write to file when action is 'file' or 'kill'
- Cost dedup key: use single 'cost' key so violation fires exactly
  once, not once per dollar crossed above the threshold
- Idle timeout comment: clarify it triggers on the next event after
  the idle window, not mid-sleep

tests/test_watch.py:
- Add test: cost violation fires exactly once across multiple events
- Add test: _dispatch_alert always calls _alert_terminal regardless
  of configured action

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code merged commit d33dff4 into main Apr 6, 2026
4 checks passed
@Siddhant-K-code Siddhant-K-code deleted the feat/issue-8-watch branch April 6, 2026 17:21
Siddhant-K-code added a commit that referenced this pull request Apr 6, 2026
agent-strace share      -- self-contained HTML replay (#15)
agent-strace postmortem -- failure analysis (#15)
agent-strace eval       -- scoring and regression testing (#16)
agent-strace watch      -- live monitoring with circuit breakers (#17)

Co-authored-by: Ona <no-reply@ona.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v1.0.0: Live monitoring + circuit breakers

1 participant