Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope) by HeadyZhang · Pull Request #7 · HeadyZhang/agent-audit

HeadyZhang · 2026-06-06T18:11:11Z

Summary

Phase 3B integrated reconciliation. PR contains 11 commits spanning all of Heady's locked Phase 2 + 3A + 3B decisions (D-A, D-B, D-C, D-E E3, D-F, D-G, D-I, plus the precursor test-count + result-file fixes).

Commits

#	Commit	Decision	What
1	`bca9af8`	Phase 2C	README.md test badge + dev block 1502/1239 → 1541
2	`506ebe3`	Phase 2C	README_CN.md test badge → 1541
3	`4116fcd`	Phase 2C	CLAUDE.md test count → 1542/1541
4	`d59d484`	Phase 3A Issue 1	Track `results/layer1_v0.19.0.json` (was untracked)
5	`7e71dee`	Phase 3B Step 1	CI gate 0.87 → 0.75 with explicit GT-coverage rationale
6	`2e121ca`	D-B	`catalog.yaml` total_vulnerabilities 45 → 43 (oracle-encoded truth); set_breakdown 20/12/13 → 18/11/11 + mixed:3
7	`df58e4e`	D-C	`labeled_samples.yaml` header 218 → 238 (236 positive + 2 negative); cascade to README/README_CN/CLAUDE/F1_REPRODUCTION
8	`d293726`	D-G	Add TODO comment to `engine.py:MCP_FINDING_RULES` (dead code, preserve)
9	`c4bc053`	D-F	Promote AGENT-048 first-class: YAML (×2 mirror) + RULE_CWE_MAPPING + docs/RULES.md row. CWE-863 per scanner (NOT CWE-829 per Heady — flagged in commit message for override)
10	`698b015`	D-A	Rule count 53 → 72 across README ×3 + README_CN ×2 + Quick Reference 56 → 72 rows + per-ASI table refreshed in EN/CN + CLAUDE.md engine.py comment
11	`b646b30`	D-E E3	v0.16-frozen banner on `docs/index.html` GitHub Pages site

Test plan

cd packages/audit && poetry run python -m pytest ../../tests/ --tb=no -q after each commit
1541 passed / 1 skipped baseline maintained from commit 1 through commit 11
Live YAML load (engine.load_rules()) returns 52 rules (was 51) including AGENT-048
catalog.yaml + oracle.yaml sum 43 verified post-D-B
labeled_samples.yaml loaded count = 238 verified post-D-C
No working-tree changes to paper/main.tex
No working-tree changes to tests/fixtures/skills/ or skill_scanner.py (H2-A preserved)

Decisions executed

Decision	Status	Notes
D-A (rule count)	✓ 72 (was D2-strict=71, lifted to 72 post-D-F)	README per-ASI table self-consistency restored
D-B (AVB count)	✓ 43 oracle-encoded	set_breakdown convention noted in commit
D-C (label count)	✓ 236 positive + 2 negative	F1_REPRODUCTION.md updated
D-E E3 (banner)	✓	Full site refresh deferred to separate PR
D-F (AGENT-048)	✓ promoted	CWE divergence flagged: scanner uses CWE-863, Heady locked CWE-829. Commit uses scanner's truth; Heady can override
D-G (dead code)	✓ TODO comment	MCP_FINDING_RULES preserved per locked decision
D-I (CLAUDE.md)	✓ rolled into D-A + D-C + test-count commits

Decisions still pending co-author sync (NOT executed)

Decision	Status	Why deferred
D-D (arxiv v2 strategy)	Pending Yi Nian + Yue Zhao	paper/main.tex untouched. Verbatim before/after diffs documented in `INTEGRATED_RECONCILIATION_PLAN.md` §B for co-author review
D-E (full GitHub Pages refresh)	Banner ships; full refresh = separate PR	Scope/scheduling
D-H (paper-era benchmark archive)	Heady lean H2 = no investment	Documentation gap in `docs/BENCHMARK-RESULTS.md` already exists

Out of scope (deferred to future workstreams)

TrustAgent audit memo "0/65 agent-specific rules" finding (different project)
skill_scanner.py WIP and tests/fixtures/skills/ (active dev, separate PR)
AGENT-048 CWE-829 vs CWE-863 reconciliation (requires scanner change; flagged for Heady override)
"9 open-source targets" README table reproducibility rebuild

Do not merge yet

Per Phase 3B hard rule 8: Heady reviews and merges. Co-author sign-off (Yi Nian + Yue Zhao) is gating only D-D execution; reconciliation merge does not require co-author sign-off.

Reference deliverables (drafts under `/Users/heady/Documents/agent-audit/`)

ARTIFACT_INVENTORY.md (Phase 1)
RULE_COUNT_VERIFICATION.md, AVB_VERIFICATION.md, OWASP_CATEGORY_VERIFICATION.md, BENCHMARK_RESULT_VERIFICATION.md, TEST_COUNT_VERIFICATION.md (Phase 2A)
RECONCILIATION_PLAN.md, HEADY_DECISIONS_NEEDED.md (Phase 2B)
EXECUTION_REPORT.md (Phase 2C)
LATEST_F1_SEARCH.md (Phase 3A)
F1_DRIFT_INVESTIGATION.md, PHASE_3A_FOLLOWUP_REPORT.md (Phase 3A follow-up)
INTEGRATED_RECONCILIATION_PLAN.md (Phase 3B Section A/B/C/D)
EXECUTION_REPORT_PHASE_3B.md (Phase 3B final, in progress)

Drift-only correction. README badge said `tests-1502%20passed` and the Development block said `# 1239 tests`. Live `poetry run pytest tests/` reports 1541 passed (1 skipped, 1542 collected). Bringing README into self-consistency at a point in time. No canonical-choice taken. Larger reconciliation deferred per HEADY_DECISIONS_NEEDED.md.

…EADME) Drift-only correction. README_CN badge said `tests-1502%20passed`. Live test count is 1541 passed. Matches the English README badge bumped in the previous commit.

… passing) Drift-only correction. CLAUDE.md said "1502 collected (1491 passing on dev env; pytest-asyncio plugin required for full pass)" — but live pytest run produces 1542 collected, 1541 passing, 1 skipped (no pytest-asyncio install required). Also updates the per-line "1239+ tests" comment in the tree diagram to "1541+ tests".

coderabbitai · 2026-06-06T18:11:33Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88cae467-534d-4efe-aba0-49ec81496eae

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR releases agent-audit v0.19.0, updating all public documentation with revised benchmark metrics (F1 0.778, Precision 73.58%, Recall 82.63% from 81 labeled samples), refreshed test counts (1541+), detailed F1 reproducibility guidance, and six example security audit materials demonstrating real-world agent-audit usage including MiroFish scans, OpenClaw reports, SQL injection PoC, vulnerable agent demo, and sample report schema.

Changes

v0.19.0 Release: Metrics Documentation and Security Audit Examples

Layer / File(s)	Summary
Release version and metrics updates across documentation `CLAUDE.md`, `README.md`, `README_CN.md`, `docs/BENCHMARK-RESULTS.md`, `docs/COMPETITIVE-COMPARISON.md`, `docs/index.html`	Updated version from v0.18.2 to v0.19.0, refreshed all benchmark metrics (F1: 0.778, Precision: 73.58%, Recall: 82.63%), bumped test counts from 1239+ to 1541+, and reframed historical (AVB-19) vs current benchmark presentation across all public documentation sites.
F1 reproducibility documentation and derivation guide `docs/F1_REPRODUCTION.md`	Comprehensive guide documenting headline F1 (0.778) as raw, reproducible via `tests/benchmark/precision_recall.py` against `labeled_samples.yaml` (GT v2.2); includes reproduction command, exact metrics (TP/FP/FN), derivation steps for prior adjusted F1 (0.842), false positive breakdown by rule range, and CI gate behavior clarification.
Examples directory overview and sample report schema `examples/README.md`, `examples/05-sample-report/sample_data.json`	Added examples/README.md documenting six example folders with descriptions and OWASP ASI coverage matrix; added sample_data.json with complete audit report schema (metadata, findings, OWASP compliance, CVSS mappings).
Example security audit reports and scan outputs `examples/01-mirofish-audit/mirofish_scan.json`, `examples/03-clawskills-deep-scan/*`	Three example audit materials: MiroFish JSON scan report with AGENT-010 system prompt injection, AGENT-034 tool misuse, and supply-chain findings; ClawSkills OWASP ASI report; ClawSkills full security report with executive summary, deep dive findings, triage breakdown, and trust model notes.
SQL injection PoC: vulnerable code demo and HackerOne-style report `examples/04-sql-injection-poc/exploit_demo.py`, `examples/04-sql-injection-poc/poc_report.md`	Complete proof-of-concept for SQL injection via postgres-mcp-server: Python demo with VulnerableExecuteQuery class showing parameterization bypass, regex-guard limitations, and scenario execution (baseline, exfiltration, blocked cases); accompanying HackerOne-style report with vulnerability analysis, attack vectors, impact/CVSS, and remediation with agent-audit v0.19.0 detection attribution.
Vulnerable agent demo with LangChain tools and MCP configuration `examples/06-vulnerable-agent-demo/agent.py`, `examples/06-vulnerable-agent-demo/mcp_config.json`	Demonstration agent module with three unsafe LangChain tools (subprocess shell command execution, unparameterized SQL queries, unrestricted HTTP requests), hard-coded OpenAI API key, and direct user input interpolation in system prompt; paired with MCP config containing filesystem server with hardcoded credentials and untrusted HTTP server reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hop along, dear reviewer, through metrics new,
v0.19.0 shines with F1 0.778 true,
From PoCs to examples, examples galore,
Agent-audit's findings are harder to ignore. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main purpose of the PR: unifying public documentation and metrics to align with canonical ground truth (test counts, benchmark results, F1 metrics) as a prerequisite for Paper 1.1 citation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch reconciliation/2026-06-06-paper-1-1-prep

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-06T18:12:38Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

README.md:101 and docs/F1_REPRODUCTION.md cite this file as the checked-in reproducibility artifact for the v0.19.0 / 2026-05-30 benchmark snapshot: TP 195 / FP 70 / FN 41 Precision 73.58% / Recall 82.63% / F1 0.7784 The file existed only in working trees as untracked output. Cold clones could not access it, breaking the README's "result file checked in at [results/layer1_v0.19.0.json]" claim. This commit brings the repo state into alignment with the public-doc claim. .gitignore does not exclude this path (verified via `git check-ignore`). The Phase 3A search located the drift; no upstream regeneration needed.

The v0.17+ rule expansion (AGENT-053..064, AGENT-110..120) added detections that fire on real patterns but lack GT v2.2 labels. This temporarily lowers measurable F1: - raw F1 ~ 0.778 (cold-clone reproducible via `python tests/benchmark/precision_recall.py`) - adjusted F1 ~ 0.842 (excluding 41 FPs from GT-unlabeled v0.17+ rules; documented in docs/F1_REPRODUCTION.md) Gate will be restored to 0.87 as GT v2.3 catches up to v0.17+ rule coverage. Rationale for explicit commit (not silent edit): the prior gate of 0.87 reflected the pre-v0.17 detection surface. Lowering the gate without record would obscure that the baseline shifted; this commit makes the shift explicit and bounded by an expected restoration condition.

…D-B) D-B locked: oracle.yaml is the source of truth (it's what run_eval.py reads); catalog.yaml's `statistics.total_vulnerabilities` field was a stale 45. Recount across all 23 oracle.yaml files by taxonomy.set_class: Set A (Injection/RCE) : 18 (non-noise) Set B (MCP/Components) : 11 Set C (Data/Auth) : 11 mixed (noise T12/T13) : 3 Total : 43 Changes: version 1.1 → 1.2 date_updated 2026-03-04 → 2026-06-06 total_vulnerabilities 45 → 43 set_breakdown A 20→18, B 12→11, C 13→11, +mixed: 3 Note on set_breakdown convention: the prior 20/12/13 totals presumably included noise samples apportioned to A/B/C. The recount uses each sample's oracle.yaml taxonomy.set_class strictly, with noise samples under a new "mixed" key (matching the catalog's existing noise entries where set: "mixed"). Refs: D-B (HEADY_DECISIONS_NEEDED.md)

The YAML header field `total_vulnerabilities: 218` was stale. precision_recall.py loads 238 labels (236 positive + 2 negative; the 2 negatives are FP-discrimination cases used to verify the scanner does NOT flag those lines). This commit decomposes the header into 236 positive + 2 negative to clarify the semantic distinction. Cascading propagation: labeled_samples.yaml header : 218 → 238 (236 positive + 2 negative) README.md:95 headline : "218 vulnerability labels" → "236 positive + 2 negative" README.md:229 details : "218 labels" → "236 positive + 2 negative labels" README_CN.md:96 (Chinese) : "218 条漏洞标签" → "236 条阳性 + 2 条阴性标签" README_CN.md:210 (Chinese) : "218 条标签" → "236 条阳性 + 2 条阴性标签" CLAUDE.md:11 metrics line : "218 labels" → "236 positive + 2 negative labels" docs/F1_REPRODUCTION.md : header line refreshed (218 stale → 238 reconciled) Also renames "agent-vuln-bench GT v2.2" → "ground-truth dataset v2.2" in README headlines, since AVB (catalog.yaml) and the labeled-samples GT are distinct artifacts (see AVB_VERIFICATION.md from Phase 2A). Refs: D-C (HEADY_DECISIONS_NEEDED.md)

D-G locked: do not delete the MCP_FINDING_RULES dict (zero usages across packages/audit/) — preserve in case a future contributor wires scanners through the engine path. Add a TODO comment so the next reader doesn't burn cycles re-investigating the lack of usages. Refs: D-G (HEADY_DECISIONS_NEEDED.md)

…ndary (D-F) Per D-F locked decision, AGENT-048 (emitted by privilege_scanner.py since v0.8.0) is now documented as a first-class rule. Previously the rule fired at runtime but had no YAML definition, no RULE_CWE_MAPPING entry, and no docs/RULES.md row — making it invisible to users browsing public docs while still affecting their scan output. Verified alignment with privilege_scanner.py:1058-1059 per hard rule 5: owasp_id = "ASI-04" matches D-F locked decision cwe_id = "CWE-863" DOES NOT match D-F's locked "CWE-829" This commit uses the scanner's actual CWE (CWE-863) per hard rule 5 ("verify alignment before commit"). The scanner is the source of truth; the YAML and RULE_CWE_MAPPING follow it. If Heady prefers CWE-829, both the scanner emission and the YAML/mapping need to be updated in a follow-up — that's out of scope for reconciliation since it changes scan output. Files changed: rules/builtin/asi_coverage_v030.yaml (+1 rule entry) packages/audit/agent_audit/rules/builtin/asi_coverage_v030.yaml (mirror) packages/audit/agent_audit/rules/engine.py (RULE_CWE_MAPPING +1) docs/RULES.md (Quick Reference +1 row) Live verification: `engine.load_rules()` now returns 52 rules (was 51), including AGENT-048 with title "Extension Permission Boundary Violation", severity high, category supply_chain_agentic, owasp_agentic_id ASI-04, cwe_id CWE-863. Refs: D-F (HEADY_DECISIONS_NEEDED.md)

Per D-A (locked: D2-strict — default-profile, runtime-emittable, CWE-mapped rules). Pre-D-F this was 71; D-F (prior commit c4bc053) adds AGENT-048 to RULE_CWE_MAPPING, lifting D2-strict to 72. Per-ASI breakdown (D2-strict, 72 rules): ASI-01 = 7 (was 6 in README; +1 AGENT-059) ASI-02 = 11 (was 9 in README; +2 AGENT-045/047) ASI-03 = 12 (was 4 in README; +8 AGENT-043/044/063/084 + existing tags) ASI-04 = 17 (was 7 in README; +10 incl. AGENT-049/058/060/062/083/085/048) ASI-05 = 5 (was 3 in README; +2 AGENT-046/061) ASI-06 = 3 (was 2 in README; +1 AGENT-116) ASI-07 = 1 (unchanged) ASI-08 = 3 (unchanged) ASI-09 = 10 (was 6 in README; +4 AGENT-064/113/118/119) ASI-10 = 3 (unchanged) TOTAL = 72 Changes: README.md — "53 rules" → "72 rules" ×3; per-ASI table refreshed README_CN.md — "40+ 规则" → "72 条规则" ×2; 中文 per-ASI table refreshed docs/RULES.md — Quick Reference grows 56 → 72 rows (added AGENT-043/ 044/045/046/047/049/058/059/060/061/062/063/064/083/ 084/085; AGENT-048 added in prior D-F commit) CLAUDE.md — engine.py tree comment "109 rules" → "92 rules" (RULE_CWE_MAPPING now 91 + AGENT-048 from D-F = 92) Closes README self-inconsistency where the per-ASI table summed to 44 but the headline cited 53. All public artifacts now agree at 72. Refs: D-A, D-F (HEADY_DECISIONS_NEEDED.md)

D-E E3 locked: the docs/index.html GitHub Pages site is frozen at v0.16.0 (49 rules, 1,142 tests, no v0.17+ feature coverage). Visitors landing on the site can be misled into thinking they're seeing current state. This commit adds a non-intrusive banner at the top of the page pointing visitors to the live README for current numbers. A full site refresh is planned as a separate PR. The banner uses inline CSS so it renders without depending on the rest of the v0.16-era stylesheet. Refs: D-E E3 (HEADY_DECISIONS_NEEDED.md)

HeadyZhang added 3 commits June 6, 2026 11:10

docs(README_CN): bump test count badge to 1541 (parity with English R…

506ebe3

…EADME) Drift-only correction. README_CN badge said `tests-1502%20passed`. Live test count is 1541 passed. Matches the English README badge bumped in the previous commit.

HeadyZhang changed the base branch from master to feature/compliance-report June 6, 2026 18:11

HeadyZhang added 8 commits June 6, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope)#7

Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope)#7
HeadyZhang wants to merge 11 commits into
feature/compliance-reportfrom
reconciliation/2026-06-06-paper-1-1-prep

HeadyZhang commented Jun 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HeadyZhang commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits

Test plan

Decisions executed

Decisions still pending co-author sync (NOT executed)

Out of scope (deferred to future workstreams)

Do not merge yet

Reference deliverables (drafts under /Users/heady/Documents/agent-audit/)

Uh oh!

coderabbitai Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 6, 2026

Welcome to Codecov 🎉

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HeadyZhang commented Jun 6, 2026 •

edited

Loading

Reference deliverables (drafts under `/Users/heady/Documents/agent-audit/`)

coderabbitai Bot commented Jun 6, 2026 •

edited

Loading