Skip to content

Commit ea774c6

Browse files
feat: strengthen timeline claim checks
1 parent 7eb1d9f commit ea774c6

8 files changed

Lines changed: 233 additions & 11 deletions

File tree

PROJECT_STATUS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ CodeVetter is a local-first desktop workbench for checking agent-generated code.
2121
- Evidence Pattern Search has its first vertical slice: deterministic risk candidate packets are generated from changed files, sensitive paths, optional `ast-grep` structural matches, blast/history context, and verification signals; the top ranked candidates and procedure gates are injected into CLI review prompts, returned in review metadata, shown in the Review sidebar, can be resolved per review, persist candidate outcomes locally, write durable procedure events for QA/fix/test-evidence runs, suggest scored verification commands from prior pass/fail recency, package-manager-aware repo scripts, changed/finding file affinity, and artifacts, run explicit local verification commands with cancelable timeout-bounded stdout/stderr log artifacts, derive additional execution links from finding/browser evidence, show a fuller procedure event timeline, and are included in copied reviewer proof with outcomes, artifacts, gates, linked execution events, and blocked-on reasons.
2222
- The catch-rate benchmark harness can now compare stored review outputs with and without deterministic evidence search using `--evidence-comparison=with:without`, including catch-rate, precision/F1, false-positive, redundant-match, newly-caught, and regressed-ID deltas in JSON and Markdown reports.
2323
- Review Memory Graph has a repo-level and review-scoped first slice: Repo Unpacked now persists a deterministic local `repo_graph` artifact with package scripts, routes, Tauri commands, DB tables, tests, and decision markers; exports local graph JSON plus agent-context markdown sidecars for optional Graphify/Hunk-style interop without adding either dependency; imports graph JSON only through an explicit Repo Unpacked file action, validates CodeVetter or loose graph-shaped JSON, and renders imported graphs as non-mutating previews; CLI review results also carry a bounded local graph over changed files, evidence candidates, procedure gates, blast/history context, the review prompt includes that graph neighborhood, Review shows a compact graph panel plus selected-finding graph focus, copied reviewer proof includes both full graph and focused finding graph nodes/edges, and Review can copy a selected finding as a Hunk-style agent-context note with file/line, evidence status, local history, focused graph, and next verification actions.
24-
- Agent Verification Timeline has a first normalized spine: Review now builds a shared task/review/QA/evidence/claim-check/fix/worktree timeline contract, attaches bounded raw-session command anchors to evidence rows with transcript excerpts, adds a dedicated Claim check row for failed/stale command claims, explicit extracted agent claims, positive test/check claims contradicted by failed/stale command evidence, unchecked findings, unresolved post-fix QA, and fixes without same-flow reruns, attaches edit-origin anchors for fix changed files to worktree rows, renders timeline stages and anchors in the sidebar, exposes first-class jump targets for findings, files, QA artifacts, fix worktrees, command source anchors, and edited files, shows same-flow post-fix QA before/after deltas with artifact anchors on the QA row, can copy segment-scoped fix packets directly from Review/Evidence/QA/Fix/Worktree timeline rows, includes clicked-row timeline replay metadata and transcript snippets in those packets, and includes the same timeline plus source/event/artifact/jump/edit/transcript anchors in copied reviewer proof.
24+
- Agent Verification Timeline has a first normalized spine: Review now builds a shared task/review/QA/evidence/claim-check/fix/worktree timeline contract, attaches bounded raw-session command anchors to evidence rows with transcript excerpts, adds a dedicated Claim check row for failed/stale command claims, unknown verification-command outcomes, explicit extracted agent claims, positive test/check claims contradicted by failed/stale command evidence, unchecked findings, latest QA failures without post-fix comparison, unresolved post-fix QA, fixes without same-flow reruns, evidence-count-only loops with no executable proof, and clean loops with passed command/QA proof counts, attaches edit-origin anchors for fix changed files to worktree rows, renders timeline stages and anchors in the sidebar, exposes first-class jump targets for findings, files, QA artifacts, fix worktrees, command source anchors, and edited files, shows same-flow post-fix QA before/after deltas with artifact anchors on the QA row, can copy segment-scoped fix packets directly from Review/Evidence/QA/Fix/Worktree timeline rows, includes clicked-row timeline replay metadata and transcript snippets in those packets, and includes the same timeline plus source/event/artifact/jump/edit/transcript anchors in copied reviewer proof.
2525
- Codebase History Explainer has a first file-level slice: Review now builds bounded, cited "why this code exists" explanations from local commits, decision markers, recurring findings, agent notes, and command anchors, renders them in the sidebar, and includes them in copied reviewer proof.
2626
- AI Session Intelligence has a first local scorecard slice: indexed sessions produce a schema-versioned six-dimension scorecard with cited evidence refs, anti-gaming notes, recommendations, normalized Claude/Codex/Cursor adapter coverage summaries, production and scorecard adapter run metadata/parse warnings in `session_adapter_runs`, a compact `session_message_archive` for normalized adapter messages/tool calls, local backfill for previously indexed Claude/Codex sessions missing archive rows, FTS-backed local archive search over messages/tool calls, startup/periodic/manual archive update events, a shared raw parser adapter contract with Claude/Codex/Cursor fixtures, Claude/Codex/Cursor production indexing wired through that contract, a Tauri IPC contract, compact Roadmap dashboard panels for scorecard and source health, visible adapter run status, per-adapter run trends, and recent-run drilldowns.
2727
- Home now opens directly into a usage dashboard panel with Today/Week/Month/Year counters above error or roadmap noise. The Verification Workbench launcher and latest build banner live on the Roadmap page, so Evidence Search, Agent Timeline, Synthetic QA, Memory Graph, History Brief, AI Sessions, transcript replay, claim checks, replay packets, timeline fix packets, edit origins, and post-fix QA deltas remain visible without pushing usage down on app launch.
@@ -31,7 +31,7 @@ CodeVetter is a local-first desktop workbench for checking agent-generated code.
3131
## Planned Next
3232

3333
1. Continue the AI Session Intelligence PRD in `docs/PRD-AI-SESSION-INTELLIGENCE.md`: add direct filesystem-watch tailing for currently open transcript files if periodic indexing proves too coarse; Claude, Codex, and Cursor production session rows now use the normalized raw parser adapter contract, production index passes persist adapter run metadata/parse warnings plus compact message/tool-call archive rows, local backfill repairs older Claude/Codex sessions with missing archive rows, Roadmap shows latest source-health status with per-adapter trends and recent-run drilldowns, Roadmap can search the normalized archive locally, and startup/periodic/manual indexes emit archive update events.
34-
2. Continue the Agent Verification Timeline PRD in `docs/PRD-AGENT-VERIFICATION-TIMELINE.md`: add richer claim-vs-evidence discrepancy parsing beyond literal positive test/check claim contradictions, command/QA/evidence-count signals, and fuller multi-turn transcript replay; raw session command anchors with bounded transcript excerpts, explicit agent-claim anchors, dedicated claim-check rows, edit-origin anchors, timeline-specific jump targets, timeline-segment replay packets, same-flow post-fix QA deltas, archive search, and the Roadmap latest-build banner are now attached to visible Review/Roadmap actions and proof export.
34+
2. Continue the Agent Verification Timeline PRD in `docs/PRD-AGENT-VERIFICATION-TIMELINE.md`: add scope-drift/repeated-edit discrepancy parsing and fuller multi-turn transcript replay; raw session command anchors with bounded transcript excerpts, explicit agent-claim anchors, command/QA/evidence-count claim-check signals, edit-origin anchors, timeline-specific jump targets, timeline-segment replay packets, same-flow post-fix QA deltas, archive search, and the Roadmap latest-build banner are now attached to visible Review/Roadmap actions and proof export.
3535
3. Continue the Codebase History Explainer PRD in `docs/PRD-CODEBASE-HISTORY-EXPLAINER.md`: turn the persisted `history_brief` slice into a queryable local history graph; Repo Unpacked history brief integration and agent-context sidecar export are now implemented.
3636
4. Curate 20-30 real public agent-generated PR benchmark cases with hand-labeled ground truth before making external catch-rate claims.
3737
5. Add benchmark fields for unverified-fix count and time/cost impact once review artifacts capture those values consistently.

apps/desktop/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@code-reviewer/desktop",
3-
"version": "1.1.48",
3+
"version": "1.1.49",
44
"private": true,
55
"scripts": {
66
"dev": "lsof -ti:1420 | xargs kill -9 2>/dev/null; vite",

apps/desktop/src-tauri/tauri.conf.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"$schema": "https://raw.githubusercontent.com/tauri-apps/tauri/dev/crates/tauri-utils/schema.json",
33
"identifier": "com.codevetter.desktop",
44
"productName": "CodeVetter",
5-
"version": "1.1.48",
5+
"version": "1.1.49",
66
"build": {
77
"beforeDevCommand": "npm run dev",
88
"beforeBuildCommand": "npm run build",

apps/desktop/src/lib/review-proof.test.ts

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,138 @@ describe("buildReviewerProofMarkdown", () => {
440440
assert.equal(contradictedClaim?.jump?.kind, "command_source");
441441
});
442442

443+
it("flags unknown verification command outcomes as claim-check proof gaps", () => {
444+
const timeline = buildVerificationTimeline({
445+
runId: "review-unknown-command",
446+
review: {
447+
findingsCount: 1,
448+
},
449+
evidenceCounts: {
450+
fixed: 0,
451+
reproduced: 1,
452+
notReproduced: 0,
453+
},
454+
history: {
455+
command_signals: [
456+
{
457+
agent: "codex",
458+
date: "2026-06-12T00:00:00Z",
459+
command: "npm run test:checkout",
460+
status: "unknown",
461+
source: "raw_session",
462+
source_path: "/tmp/session.jsonl",
463+
source_line: 22,
464+
event_id: "cmd-unknown",
465+
session_id: "session-unknown-command",
466+
},
467+
],
468+
},
469+
});
470+
471+
const claimCheckStep = timeline.find((item) => item.id === "claim-check");
472+
assert.equal(claimCheckStep?.status, "active");
473+
assert.match(claimCheckStep?.detail ?? "", /0 blocking, 1 need proof/);
474+
assert.equal(
475+
claimCheckStep?.anchors?.[0]?.label,
476+
"Unverified command outcome: npm run test:checkout",
477+
);
478+
assert.equal(claimCheckStep?.anchors?.[0]?.source, "raw_session");
479+
assert.equal(claimCheckStep?.anchors?.[0]?.jump?.kind, "command_source");
480+
});
481+
482+
it("blocks claim checks when latest QA is still failing without a comparison", () => {
483+
const timeline = buildVerificationTimeline({
484+
runId: "review-latest-qa-failed",
485+
review: {
486+
findingsCount: 1,
487+
},
488+
qa: {
489+
latest: {
490+
pass: false,
491+
runnerType: "repo_playwright",
492+
route: "/checkout",
493+
goal: "Complete checkout",
494+
durationMs: 900,
495+
screenshotPath: "artifacts/latest-fail.png",
496+
artifacts: ["artifacts/latest-fail.log"],
497+
},
498+
},
499+
evidenceCounts: {
500+
fixed: 1,
501+
reproduced: 0,
502+
notReproduced: 0,
503+
},
504+
});
505+
506+
const claimCheckStep = timeline.find((item) => item.id === "claim-check");
507+
assert.equal(claimCheckStep?.status, "blocked");
508+
assert.match(claimCheckStep?.detail ?? "", /1 blocking, 0 need proof/);
509+
assert.equal(claimCheckStep?.anchors?.[0]?.label, "Latest QA still failing: /checkout");
510+
assert.equal(claimCheckStep?.anchors?.[0]?.artifact, "artifacts/latest-fail.png");
511+
assert.equal(claimCheckStep?.anchors?.[0]?.jump?.kind, "artifact");
512+
});
513+
514+
it("flags evidence-count-only loops that lack executable proof", () => {
515+
const timeline = buildVerificationTimeline({
516+
runId: "review-thin-proof",
517+
review: {
518+
findingsCount: 2,
519+
},
520+
evidenceCounts: {
521+
fixed: 1,
522+
reproduced: 1,
523+
notReproduced: 0,
524+
},
525+
});
526+
527+
const claimCheckStep = timeline.find((item) => item.id === "claim-check");
528+
assert.equal(claimCheckStep?.status, "active");
529+
assert.match(claimCheckStep?.detail ?? "", /0 blocking, 1 need proof/);
530+
assert.equal(
531+
claimCheckStep?.anchors?.[0]?.label,
532+
"Executable proof missing: 2 evidence statuses for 2 findings",
533+
);
534+
assert.equal(claimCheckStep?.anchors?.[0]?.source, "review:evidence-strength");
535+
assert.match(claimCheckStep?.anchors?.[0]?.contextExcerpt?.join("\n") ?? "", /0 passed verification commands/);
536+
});
537+
538+
it("recognizes passed command proof when claim gaps are clean", () => {
539+
const timeline = buildVerificationTimeline({
540+
runId: "review-good-loop",
541+
review: {
542+
findingsCount: 1,
543+
},
544+
evidenceCounts: {
545+
fixed: 0,
546+
reproduced: 1,
547+
notReproduced: 0,
548+
},
549+
history: {
550+
command_signals: [
551+
{
552+
agent: "codex",
553+
date: "2026-06-12T00:00:00Z",
554+
command: "npm run test:checkout",
555+
status: "passed",
556+
source: "raw_session",
557+
source_path: "/tmp/session.jsonl",
558+
source_line: 30,
559+
event_id: "cmd-passed",
560+
session_id: "session-good-loop",
561+
},
562+
],
563+
},
564+
});
565+
566+
const claimCheckStep = timeline.find((item) => item.id === "claim-check");
567+
assert.equal(claimCheckStep?.status, "done");
568+
assert.match(
569+
claimCheckStep?.detail ?? "",
570+
/No claim\/evidence gaps detected · 1 passed verification command/,
571+
);
572+
assert.equal(claimCheckStep?.anchors?.length, 0);
573+
});
574+
443575
it("copies concrete command evidence into finding handoff proof", () => {
444576
const history = new Map<number, HistoryFindingSummary>();
445577
history.set(0, {

apps/desktop/src/lib/review-proof.ts

Lines changed: 91 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -449,6 +449,20 @@ function isPositiveVerificationClaim(claim: string): boolean {
449449
].some((pattern) => pattern.test(normalized));
450450
}
451451

452+
function isVerificationCommandLabel(label: string): boolean {
453+
const normalized = label.trim().toLowerCase();
454+
return [
455+
/\b(?:npm|pnpm|yarn|bun)\s+(?:run\s+)?(?:test|lint|build|typecheck|check|e2e|playwright)\b/,
456+
/\b(?:cargo\s+(?:test|clippy|build)|go\s+test|pytest|vitest|jest|tsc|eslint|playwright|cypress)\b/,
457+
/\b(?:test|lint|build|typecheck|check|e2e|qa|ci)\b/,
458+
].some((pattern) => pattern.test(normalized));
459+
}
460+
461+
function latestQaArtifact(run: NonNullable<VerificationTimelineInput["qa"]>["latest"]): string | null {
462+
if (!run) return null;
463+
return run.screenshotPath ?? run.artifacts?.[0] ?? null;
464+
}
465+
452466
function buildClaimCheckTimelineAnchors(
453467
input: VerificationTimelineInput,
454468
commandAnchors: VerificationTimelineAnchor[],
@@ -459,6 +473,12 @@ function buildClaimCheckTimelineAnchors(
459473
const runId = input.runId?.trim() || "active-review";
460474
const findingsCount = Math.max(0, input.review?.findingsCount ?? 0);
461475
const uncheckedCount = Math.max(0, findingsCount - evidenceTotal);
476+
const passedVerificationCommandCount = commandAnchors.filter(
477+
(anchor) => anchor.status === "passed" && isVerificationCommandLabel(anchor.label),
478+
).length;
479+
const successfulQaProofCount =
480+
(input.qa?.latest?.pass ? 1 : 0) +
481+
(qaComparison && qaComparisonStatusToTimelineStatus(qaComparison.status) === "done" ? 1 : 0);
462482

463483
commandAnchors
464484
.filter((anchor) => anchor.status === "failed" || anchor.status === "stale")
@@ -473,6 +493,21 @@ function buildClaimCheckTimelineAnchors(
473493
});
474494
});
475495

496+
commandAnchors
497+
.filter((anchor) => anchor.status === "unknown" && isVerificationCommandLabel(anchor.label))
498+
.slice(0, 2)
499+
.forEach((anchor) => {
500+
anchors.push({
501+
...anchor,
502+
id: `claim:unknown-command:${anchor.id}`,
503+
label: `Unverified command outcome: ${anchor.label}`,
504+
status: "unknown",
505+
contextExcerpt: anchor.contextExcerpt?.length
506+
? anchor.contextExcerpt
507+
: ["Command was observed without a pass/fail status; rerun it or attach its log before trusting the claim."],
508+
});
509+
});
510+
476511
const contradictingCommand = commandAnchors.find(
477512
(anchor) => anchor.status === "failed" || anchor.status === "stale",
478513
);
@@ -513,6 +548,27 @@ function buildClaimCheckTimelineAnchors(
513548
});
514549
}
515550

551+
if (input.qa?.latest && !input.qa.latest.pass && !qaComparison) {
552+
const artifact = latestQaArtifact(input.qa.latest);
553+
anchors.push({
554+
id: `${runId}:claim:latest-qa-failed`,
555+
label: `Latest QA still failing: ${input.qa.latest.route ?? input.qa.latest.goal}`,
556+
source: `qa:${input.qa.latest.runnerType}`,
557+
status: "failed",
558+
sourcePath: artifact,
559+
eventId: `${runId}:claim:latest-qa-failed`,
560+
sessionId: runId,
561+
artifact,
562+
jump: artifact
563+
? {
564+
kind: "artifact",
565+
label: "Open latest QA artifact",
566+
path: artifact,
567+
}
568+
: null,
569+
});
570+
}
571+
516572
if (qaComparison) {
517573
const status = qaComparisonStatusToTimelineStatus(qaComparison.status);
518574
if (status !== "done") {
@@ -558,6 +614,26 @@ function buildClaimCheckTimelineAnchors(
558614
});
559615
}
560616

617+
if (
618+
anchors.length === 0 &&
619+
findingsCount > 0 &&
620+
evidenceTotal >= findingsCount &&
621+
passedVerificationCommandCount + successfulQaProofCount === 0
622+
) {
623+
anchors.push({
624+
id: `${runId}:claim:executable-proof-missing`,
625+
label: `Executable proof missing: ${evidenceTotal} evidence status${evidenceTotal === 1 ? "" : "es"} for ${findingsCount} finding${findingsCount === 1 ? "" : "s"}`,
626+
source: "review:evidence-strength",
627+
status: "unknown",
628+
contextExcerpt: [
629+
`${input.evidenceCounts.reproduced} reproduced, ${input.evidenceCounts.fixed} fixed, ${input.evidenceCounts.notReproduced} not reproduced`,
630+
"0 passed verification commands, 0 passing QA proofs",
631+
],
632+
eventId: `${runId}:claim:executable-proof-missing`,
633+
sessionId: runId,
634+
});
635+
}
636+
561637
return anchors
562638
.sort((a, b) => statusRank(a.status) - statusRank(b.status))
563639
.slice(0, 4);
@@ -722,6 +798,20 @@ export function buildVerificationTimeline(
722798
evidenceTotal,
723799
);
724800
const failedCommandCount = commandAnchors.filter((anchor) => anchor.status === "failed").length;
801+
const passedVerificationCommandCount = commandAnchors.filter(
802+
(anchor) => anchor.status === "passed" && isVerificationCommandLabel(anchor.label),
803+
).length;
804+
const successfulQaProofCount =
805+
(latestQa?.pass ? 1 : 0) +
806+
(qaComparison && qaComparisonStatusToTimelineStatus(qaComparison.status) === "done" ? 1 : 0);
807+
const proofSignalDetail = [
808+
passedVerificationCommandCount > 0
809+
? `${passedVerificationCommandCount} passed verification command${passedVerificationCommandCount === 1 ? "" : "s"}`
810+
: null,
811+
successfulQaProofCount > 0
812+
? `${successfulQaProofCount} QA proof${successfulQaProofCount === 1 ? "" : "s"}`
813+
: null,
814+
].filter(Boolean).join(", ");
725815
const selectedFindingIndex = input.review?.selectedFindingIndex ?? null;
726816
const firstFindingPath = input.review?.firstFindingPath?.trim();
727817
const firstFindingLine = input.review?.firstFindingLine ?? null;
@@ -796,7 +886,7 @@ export function buildVerificationTimeline(
796886
const claimCheckDetail = claimCheckAnchors.length > 0
797887
? `${blockedClaimCount} blocking, ${pendingClaimCount} need proof`
798888
: claimCheckStatus === "done"
799-
? "No claim/evidence gaps detected"
889+
? `No claim/evidence gaps detected${proofSignalDetail ? ` · ${proofSignalDetail}` : ""}`
800890
: "No claims checked yet";
801891
const claimCheckJump = claimCheckAnchors.find((anchor) => anchor.jump)?.jump ?? null;
802892

0 commit comments

Comments
 (0)