Skip to content

feat(replay): per-step session replay with hash-chained journal#1810

Merged
chernistry merged 2 commits into
mainfrom
feat/1799-step-replay-merkle
May 21, 2026
Merged

feat(replay): per-step session replay with hash-chained journal#1810
chernistry merged 2 commits into
mainfrom
feat/1799-step-replay-merkle

Conversation

@chernistry

Copy link
Copy Markdown
Collaborator

Summary

Adds a per-step replay surface on top of lineage v1. Each agent step is written to a hash-chained journal under .sdd/runtime/journal/<agent_id>/ where step_hash = SHA256(canonical_json({prev_hash, input_hash, model, prompt, tool_call, tool_result})). The chain head is the run's verifiable identity.

Closes #1799.

Changes

New modules:

  • src/bernstein/core/persistence/journal.py - chained journal + canonical step encoding (load-bearing contract documented inline)
  • src/bernstein/core/persistence/journal_diff.py - precise per-field divergence detector
  • src/bernstein/core/persistence/journal_export.py - portable, offline-verifiable receipt format
  • src/bernstein/core/persistence/journal_publish.py - privacy-redacted publish with chain re-anchoring
  • src/bernstein/cli/commands/replay_cmd.py - CLI helpers for the new verbs
  • docs/operations/replay.md - operator-facing documentation

Modified:

  • src/bernstein/cli/commands/advanced_cmd.py - extended replay to dispatch the new verbs without breaking the legacy replay <run_id> shape
  • src/bernstein/cli/commands/session_cmd.py - added --from-step and --prompt to session fork
  • src/bernstein/core/sessions/fork.py - fork_session now supports from_step; seeds the fork journal with the parent chain prefix
  • src/bernstein/core/security/audit.py - additive new event-type constants (replay.step, replay.fork, replay.export, replay.publish)

Tests:

  • tests/unit/test_journal_chain.py - 19 cases for step-hash determinism, chain integrity, atomic append, reconstruction
  • tests/unit/test_journal_divergence.py - precise field-level diff
  • tests/unit/test_journal_export.py - receipt format + tamper detection
  • tests/unit/test_journal_publish.py - opt-in publish + redaction
  • tests/unit/test_replay_journal_cli.py - CLI helper exit-code contract
  • tests/integration/test_fork_from_step.py - fork-from-step end-to-end + backward-compat regression net
  • tests/integration/test_replay_divergence.py - forced non-determinism in a test adapter -> precise diff
  • tests/integration/test_replay_receipt_roundtrip.py - export+verify offline + signed receipt + redacted publish

Acceptance criteria

  • Per-step journal under .sdd/runtime/journal/<agent_id>/<bucket>.jsonl with atomic append (one line per call under a lock)
  • step_hash = SHA256(canonical_json(prev_hash, input_hash, model, prompt, tool_call, tool_result)); head hash is the run identity
  • bernstein replay <agent_id> verifies the chain matches the recorded head before rendering
  • bernstein session fork <session_id> --from-step <n> materialises a sibling worktree and records the parent step hash as the chain root
  • Non-determinism surfaces as a precise field diff via replay diff-journal and the StepDivergence dataclass; orchestrator never silently accepts a divergent replay
  • bernstein replay export <agent_id> produces an offline-verifiable receipt (tarball with canonical manifest + chain + CAS blobs)
  • Local-only default; bernstein replay publish requires explicit --opt-in; re-anchors the chain to redacted payloads so verification still works post-redaction
  • Existing bernstein git undo, plain bernstein session fork (no --from-step), and the audit-slice extractor work unchanged

Test plan

  • uv run pytest tests/unit -q --no-cov --timeout=120 -k "journal or replay or fork" (290 passed locally)
  • uv run pytest tests/integration/test_fork_from_step.py tests/integration/test_replay_divergence.py tests/integration/test_replay_receipt_roundtrip.py -q --no-cov --timeout=120 (13 passed locally)
  • uv run ruff check src/ tests/ (clean)
  • uv run ruff format --check src/bernstein/core/persistence/journal*.py src/bernstein/cli/commands/replay_cmd.py (clean)
  • uv run pyright --project pyrightconfig.strict.json (strict zone clean)
  • CI green

Adds a per-step replay surface on top of lineage v1. Each agent step
is written to a hash-chained journal under .sdd/runtime/journal/
where step_hash = SHA256(canonical_json({prev_hash, input_hash,
model, prompt, tool_call, tool_result})). The chain head is the
run's verifiable identity.

CLI verbs (additive on top of the existing replay command):

- bernstein replay <agent_id> renders the per-step view and verifies
  the chain matches the recorded head before any rendering.
- bernstein session fork <session_id> --from-step <n> seeds the fork
  worktree journal with the parent prefix [0..N] and records the
  parent step_hash so the chain becomes a tree.
- bernstein replay export <agent_id> writes a portable tarball
  receipt; verify_receipt walks the chain offline.
- bernstein replay publish <agent_id> --opt-in runs redaction and
  re-anchors the chain so the published receipt still verifies.
- bernstein replay diff-journal A B surfaces the first divergent
  field rather than a flaky-test signature.

New audit event-type entries (replay.step, replay.fork,
replay.export, replay.publish) ride the existing HMAC-chained audit
log; existing entries are untouched.

Backward compatibility: bernstein git undo, plain session fork
(no --from-step), and the audit-slice extractor work unchanged.

Closes #1799
@chernistry chernistry enabled auto-merge (squash) May 21, 2026 20:27

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @chernistry, you have reached your weekly rate limit of 2500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@github-actions

Copy link
Copy Markdown
Contributor

Sonar insights (advisory, no merge-block)

Snapshot of bernstein on the configured Sonar instance:

Metric Value
Coverage 13.4
Code smells 126
Bugs 11
Vulnerabilities 2
Security hotspots 87

Run bernstein doctor sonar locally for the full surface.

This comment is a soft signal. The Sonar scan runs on push to main; the PR check itself never fails on smells.

@github-actions

Copy link
Copy Markdown
Contributor

Review-bot acknowledgement summary

  • Must-address findings: 0 (0 acknowledged, 0 open)
  • Informational findings: 0

All must-address findings are resolved or acknowledged.

@github-actions

Copy link
Copy Markdown
Contributor

bernstein doctor observe for PR #1810 (feat/1799-step-replay-merkle): ok=0, warn=2, fail=0, error=0, skipped=2

sonar -- WARN (project bernstein)

metric value delta threshold status
coverage_pct 13.4% new 80.0% fail
code_smells 126 new 50 warn
bugs 11 new 0 fail
vulnerabilities 2 new 0 warn
security_hotspots 87 new 0 fail

code-scanning -- WARN (22 open alert(s))

metric value delta threshold status
open_alerts 22 new 0 fail
critical_alerts 0 new 0 ok
high_alerts 2 new 0 warn
medium_alerts 0 new - ok
low_alerts 0 new - ok
Skipped backends (credentials not configured)
  • glitchtip: BERNSTEIN_GLITCHTIP_TOKEN not set
  • dt: DTRACK_URL/TOKEN/PROJECT not set

See docs/observability/unified-doctor.md for backend setup notes.

@chernistry chernistry merged commit 6160ea6 into main May 21, 2026
25 of 26 checks passed
@chernistry chernistry deleted the feat/1799-step-replay-merkle branch May 21, 2026 20:28
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Mutation gate (fixed critical paths)

Module Kill rate Threshold Killed/Total Status Notes
audit_integrity 100.0% 70% 38/38 -
audit_log 100.0% 70% 57/57 -
claim_next 84.9% 70% 45/53 8 survivors
config_seed_parser 96.5% 70% 56/58 budget exceeded; 2 survivors
lineage_gate 80.0% 75% 28/35 7 survivors
lineage_merge 75.0% 75% 6/8 2 survivors
lineage_tips 93.8% 75% 15/16 1 survivors

Gate is advisory while thresholds stabilise. To kill survivors locally:
uv run python scripts/mutmut_critical.py --only <module>

"tool_result": self.tool_result,
"step_hash": self.step_hash,
"ts": self.ts,
"blob_refs": list(self.blob_refs),
"steps": self.steps,
"bernstein_version": self.bernstein_version,
"created_at": self.created_at,
"blob_digests": list(self.blob_digests),
"steps": self.steps,
"bernstein_version": self.bernstein_version,
"created_at": self.created_at,
"blob_digests": list(self.blob_digests),


# Re-export so callers don't need a separate journal import for the symbol.
JournalError = JournalError

def _redact_row(row: dict[str, Any], policy: RedactionPolicy) -> dict[str, Any]:
"""Return a copy of *row* with redaction policy applied."""
redacted = dict(row)
@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@chernistry has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c4ab9640-0878-4b17-9297-fae95e001a54

📥 Commits

Reviewing files that changed from the base of the PR and between da8e804 and 0265d4e.

📒 Files selected for processing (18)
  • docs/operations/replay.md
  • src/bernstein/cli/commands/advanced_cmd.py
  • src/bernstein/cli/commands/replay_cmd.py
  • src/bernstein/cli/commands/session_cmd.py
  • src/bernstein/core/persistence/journal.py
  • src/bernstein/core/persistence/journal_diff.py
  • src/bernstein/core/persistence/journal_export.py
  • src/bernstein/core/persistence/journal_publish.py
  • src/bernstein/core/security/audit.py
  • src/bernstein/core/sessions/fork.py
  • tests/integration/test_fork_from_step.py
  • tests/integration/test_replay_divergence.py
  • tests/integration/test_replay_receipt_roundtrip.py
  • tests/unit/test_journal_chain.py
  • tests/unit/test_journal_divergence.py
  • tests/unit/test_journal_export.py
  • tests/unit/test_journal_publish.py
  • tests/unit/test_replay_journal_cli.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/1799-step-replay-merkle

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-step session replay with hash-chained journal

2 participants