Skip to content

feat: v0.22.0 — semantic session diff (compare outcomes between two runs)#38

Merged
Siddhant-K-code merged 1 commit into
mainfrom
feat/v0.22.0-semantic-diff
Apr 11, 2026
Merged

feat: v0.22.0 — semantic session diff (compare outcomes between two runs)#38
Siddhant-K-code merged 1 commit into
mainfrom
feat/v0.22.0-semantic-diff

Conversation

@Siddhant-K-code
Copy link
Copy Markdown
Owner

Closes #28

What

Extends diff.py with a --semantic mode that compares two sessions at the outcome level rather than phase structure.

Output

Semantic diff: a1b2c3d4e5f6 vs b7c8d9e0f1a2
─────────────────────────────────────────────────────────────────────
                                 Session A    Session B    Change
─────────────────────────────────────────────────────────────────────
  Duration                          3m22s        2m14s      -33%
  Cost                            $0.0067      $0.0041      -39%
  Errors                                2            0      -100%
  Tool calls                           18           14       -22%
  LLM requests                          6            5       -17%
  Retries                               3            1       -67%
─────────────────────────────────────────────────────────────────────
  Files read (both)    src/main.py, tests/test_foo.py
  Files written (A only)  dist/bundle.js
  Files written (B only)  dist/bundle.min.js
  Commands (both)      pytest
─────────────────────────────────────────────────────────────────────
  Verdict: B is better

Verdict logic

B is better when it wins on more metrics (errors, cost, duration, retries) than A with no regressions. Ties → inconclusive.

Eval integration

When --eval-config points to a .agent-evals.yaml, eval scores for both sessions are included in the diff table.

CLI

agent-strace diff <session-a> <session-b> --semantic [--eval-config .agent-evals.yaml]

Tests

tests/test_semantic_diff.py — 10 tests.

diff.py gains a --semantic mode that compares two sessions at the
outcome level: cost, duration, errors, retries, files read/written,
commands run, and optional eval scores. Reports which files/commands
were unique to each session and gives a verdict (A is better / B is
better / inconclusive) based on errors, cost, duration, and retries.

CLI: agent-strace diff <session-a> <session-b> --semantic [--eval-config]

Closes #28

Co-authored-by: Ona <no-reply@ona.com>
@Siddhant-K-code Siddhant-K-code merged commit 13a916d into main Apr 11, 2026
4 checks passed
@Siddhant-K-code Siddhant-K-code deleted the feat/v0.22.0-semantic-diff branch April 11, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v0.22.0: Session diff - compare what changed between two runs of the same task

1 participant