Skip to content

feat: safety hardening + exception hierarchy + replay harness with CI gate#9

Merged
ArielB1980 merged 1 commit intomainfrom
safety-hardening-replay-harness
Feb 14, 2026
Merged

feat: safety hardening + exception hierarchy + replay harness with CI gate#9
ArielB1980 merged 1 commit intomainfrom
safety-hardening-replay-harness

Conversation

@ArielB1980
Copy link
Owner

Summary

Comprehensive safety hardening of the live trading stack, structured exception handling across all critical paths, and a production-faithful event-driven replay backtest harness with CI integration.

Capital Safety (P0)

  • DRY_RUN at transport boundary: KrakenClient.place_futures_order() refuses real orders when dry_run=True — no silent simulation
  • Global order rate limiter: 60 orders/min, 10 orders/10s in ExecutionGateway — catches runaway loops, recursion bugs
  • max_loss_per_trade_usd: Risk manager rejects trades where stop distance x size exceeds configurable max dollar loss
  • Deadman switch: Heartbeat file at runtime/heartbeat.txt updated each tick — external watchdog detects hung loops, DNS hangs, event loop starvation

Exception Hierarchy (Tier 1)

  • Replaced all bare except Exception: pass in kill switch, live trading, and safety integration with structured handling:
    • OperationalError -> retry/backoff (bounded)
    • InvariantError -> halt (fail-fast)
    • DataError -> log + skip
    • Unknown -> log + crash (systemd restarts)
  • Narrowed circuit breaker classification: whitelist of networky errors only (no more string-matching "500")
  • Routed stop self-heal + ShockGuard through ExecutionGateway (single choke point, consistent WAL/breaker/logging)

Replay Backtest Harness

  • ReplayKrakenClient: Drop-in simulated Kraken exchange with:
    • Stop entered_book lifecycle with vol/depth-dependent delay + seeded jitter
    • Maker/taker via mid-crossing at placement (not fixed 80/20)
    • reduceOnly caps at flat; non-reduce can flip with two logical fills
    • Order rejections: min size, reduceOnly conflict, insufficient margin
    • Layer 1 visibility quirk toggle (hide_entered_book_from_open_orders)
    • Per-symbol funding rate curves with vol-spike multiplier
    • Deterministic seeded jitter on fills, delays, slippage (--seed N)
    • Per-API-call latency model (50-200ms, seeded)
  • FaultInjector: Scripted API timeouts, rate limits, data errors, AttributeError at specific timestamps
  • 6 episodes: Normal market, high-vol spike, liquidity drought, API outage, restart/split-brain, bug injection
  • Safety-first pass/fail: Episodes fail on invariant violations, kill switch, rate limiter trips, breaker opens (context-dependent)
  • make replay, make replay-sweep (seeds 1-5)

CI Gate

  • .github/workflows/replay-gate.yml triggers on PRs touching execution/, risk/, safety/, kraken_client.py, live/, circuit_breaker.py, replay_harness/
  • Matrix: unit tests -> replay across 3 jitter seeds (42, 1, 7)
  • Artifacts uploaded per seed (14-day retention)

Docs and Cleanup

  • Archived 28 obsolete docs and 20 legacy scripts to docs/archive/ and scripts/archive/
  • Updated FORAI.md with lessons learned
  • New operational tools: pre_flight_check, sync_positions, recover_sl_order_ids, check_tp_coverage, monitor_trade_execution

Test plan

  • 500/500 unit tests passing (includes 49 replay harness tests)
  • No linter errors on changed files
  • No secrets in commit
  • make smoke with .env.local on local machine (requires credentials)
  • make replay passes all 6 episodes
  • CI replay gate passes on this PR (3 seeds)
  • After merge: make deploy -> verify systemd restart + heartbeat file appears

Made with Cursor

… gate

Capital safety (P0):
- DRY_RUN enforced at KrakenClient transport boundary
- Global order rate limiter in ExecutionGateway (60/min, 10/10s)
- max_loss_per_trade_usd risk check in risk manager
- Trading activity heartbeat/deadman switch via runtime/heartbeat.txt

Exception hierarchy (Tier 1):
- Replace bare except/pass in kill switch, live trading, safety integration
- OperationalError → retry/backoff, InvariantError → halt, unknown → crash
- Narrow circuit breaker classification (whitelist networky errors only)
- Route stop self-heal + ShockGuard through ExecutionGateway

Replay backtest harness:
- Event-driven harness replaying real LiveTrading._tick() against simulated
  Kraken exchange (ReplayKrakenClient) with deterministic SimClock
- Faithful exchange modeling: stop entered_book lifecycle, maker/taker via
  mid-crossing, reduceOnly caps at flat, position reversal as two fills
- Order rejection realism (min size, reduceOnly conflict, insufficient margin)
- Layer 1 visibility quirk toggle (entered_book hidden from open orders)
- Per-symbol funding rate curves with vol-spike variability
- Deterministic seeded jitter on fills, delays, slippage (--seed N)
- Per-API-call latency model (50-200ms seeded)
- FaultInjector for scripted outages, rate limits, data errors
- 6 episodes: normal, high-vol, drought, outage, restart/split-brain, bug
- Safety-first pass/fail criteria per episode
- CI gate (.github/workflows/replay-gate.yml) runs on PRs touching
  execution/risk/safety/client/live paths, matrix across 3 jitter seeds
- make replay, make replay-episode, make replay-sweep targets

Docs & cleanup:
- Archive obsolete docs and scripts
- Update FORAI.md with lessons learned
- 500/500 unit tests passing (49 replay harness tests)

Co-authored-by: Cursor <cursoragent@cursor.com>
@ArielB1980 ArielB1980 merged commit 13e816b into main Feb 14, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant