Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope)#7
Conversation
Drift-only correction. README badge said `tests-1502%20passed` and the Development block said `# 1239 tests`. Live `poetry run pytest tests/` reports 1541 passed (1 skipped, 1542 collected). Bringing README into self-consistency at a point in time. No canonical-choice taken. Larger reconciliation deferred per HEADY_DECISIONS_NEEDED.md.
…EADME) Drift-only correction. README_CN badge said `tests-1502%20passed`. Live test count is 1541 passed. Matches the English README badge bumped in the previous commit.
… passing) Drift-only correction. CLAUDE.md said "1502 collected (1491 passing on dev env; pytest-asyncio plugin required for full pass)" — but live pytest run produces 1542 collected, 1541 passing, 1 skipped (no pytest-asyncio install required). Also updates the per-line "1239+ tests" comment in the tree diagram to "1541+ tests".
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR releases agent-audit v0.19.0, updating all public documentation with revised benchmark metrics (F1 0.778, Precision 73.58%, Recall 82.63% from 81 labeled samples), refreshed test counts (1541+), detailed F1 reproducibility guidance, and six example security audit materials demonstrating real-world agent-audit usage including MiroFish scans, OpenClaw reports, SQL injection PoC, vulnerable agent demo, and sample report schema. Changesv0.19.0 Release: Metrics Documentation and Security Audit Examples
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
README.md:101 and docs/F1_REPRODUCTION.md cite this file as the checked-in reproducibility artifact for the v0.19.0 / 2026-05-30 benchmark snapshot: TP 195 / FP 70 / FN 41 Precision 73.58% / Recall 82.63% / F1 0.7784 The file existed only in working trees as untracked output. Cold clones could not access it, breaking the README's "result file checked in at [results/layer1_v0.19.0.json]" claim. This commit brings the repo state into alignment with the public-doc claim. .gitignore does not exclude this path (verified via `git check-ignore`). The Phase 3A search located the drift; no upstream regeneration needed.
The v0.17+ rule expansion (AGENT-053..064, AGENT-110..120) added
detections that fire on real patterns but lack GT v2.2 labels.
This temporarily lowers measurable F1:
- raw F1 ~ 0.778 (cold-clone reproducible via
`python tests/benchmark/precision_recall.py`)
- adjusted F1 ~ 0.842 (excluding 41 FPs from GT-unlabeled v0.17+
rules; documented in docs/F1_REPRODUCTION.md)
Gate will be restored to 0.87 as GT v2.3 catches up to v0.17+
rule coverage.
Rationale for explicit commit (not silent edit): the prior gate
of 0.87 reflected the pre-v0.17 detection surface. Lowering the
gate without record would obscure that the baseline shifted; this
commit makes the shift explicit and bounded by an expected
restoration condition.
…D-B) D-B locked: oracle.yaml is the source of truth (it's what run_eval.py reads); catalog.yaml's `statistics.total_vulnerabilities` field was a stale 45. Recount across all 23 oracle.yaml files by taxonomy.set_class: Set A (Injection/RCE) : 18 (non-noise) Set B (MCP/Components) : 11 Set C (Data/Auth) : 11 mixed (noise T12/T13) : 3 Total : 43 Changes: version 1.1 → 1.2 date_updated 2026-03-04 → 2026-06-06 total_vulnerabilities 45 → 43 set_breakdown A 20→18, B 12→11, C 13→11, +mixed: 3 Note on set_breakdown convention: the prior 20/12/13 totals presumably included noise samples apportioned to A/B/C. The recount uses each sample's oracle.yaml taxonomy.set_class strictly, with noise samples under a new "mixed" key (matching the catalog's existing noise entries where set: "mixed"). Refs: D-B (HEADY_DECISIONS_NEEDED.md)
The YAML header field `total_vulnerabilities: 218` was stale. precision_recall.py loads 238 labels (236 positive + 2 negative; the 2 negatives are FP-discrimination cases used to verify the scanner does NOT flag those lines). This commit decomposes the header into 236 positive + 2 negative to clarify the semantic distinction. Cascading propagation: labeled_samples.yaml header : 218 → 238 (236 positive + 2 negative) README.md:95 headline : "218 vulnerability labels" → "236 positive + 2 negative" README.md:229 details : "218 labels" → "236 positive + 2 negative labels" README_CN.md:96 (Chinese) : "218 条漏洞标签" → "236 条阳性 + 2 条阴性标签" README_CN.md:210 (Chinese) : "218 条标签" → "236 条阳性 + 2 条阴性标签" CLAUDE.md:11 metrics line : "218 labels" → "236 positive + 2 negative labels" docs/F1_REPRODUCTION.md : header line refreshed (218 stale → 238 reconciled) Also renames "agent-vuln-bench GT v2.2" → "ground-truth dataset v2.2" in README headlines, since AVB (catalog.yaml) and the labeled-samples GT are distinct artifacts (see AVB_VERIFICATION.md from Phase 2A). Refs: D-C (HEADY_DECISIONS_NEEDED.md)
D-G locked: do not delete the MCP_FINDING_RULES dict (zero usages across packages/audit/) — preserve in case a future contributor wires scanners through the engine path. Add a TODO comment so the next reader doesn't burn cycles re-investigating the lack of usages. Refs: D-G (HEADY_DECISIONS_NEEDED.md)
…ndary (D-F)
Per D-F locked decision, AGENT-048 (emitted by privilege_scanner.py
since v0.8.0) is now documented as a first-class rule. Previously
the rule fired at runtime but had no YAML definition, no
RULE_CWE_MAPPING entry, and no docs/RULES.md row — making it
invisible to users browsing public docs while still affecting their
scan output.
Verified alignment with privilege_scanner.py:1058-1059 per hard rule 5:
owasp_id = "ASI-04" matches D-F locked decision
cwe_id = "CWE-863" DOES NOT match D-F's locked "CWE-829"
This commit uses the scanner's actual CWE (CWE-863) per hard rule 5
("verify alignment before commit"). The scanner is the source of
truth; the YAML and RULE_CWE_MAPPING follow it. If Heady prefers
CWE-829, both the scanner emission and the YAML/mapping need to be
updated in a follow-up — that's out of scope for reconciliation
since it changes scan output.
Files changed:
rules/builtin/asi_coverage_v030.yaml (+1 rule entry)
packages/audit/agent_audit/rules/builtin/asi_coverage_v030.yaml (mirror)
packages/audit/agent_audit/rules/engine.py (RULE_CWE_MAPPING +1)
docs/RULES.md (Quick Reference +1 row)
Live verification: `engine.load_rules()` now returns 52 rules (was 51),
including AGENT-048 with title "Extension Permission Boundary Violation",
severity high, category supply_chain_agentic, owasp_agentic_id ASI-04,
cwe_id CWE-863.
Refs: D-F (HEADY_DECISIONS_NEEDED.md)
Per D-A (locked: D2-strict — default-profile, runtime-emittable, CWE-mapped rules). Pre-D-F this was 71; D-F (prior commit c4bc053) adds AGENT-048 to RULE_CWE_MAPPING, lifting D2-strict to 72. Per-ASI breakdown (D2-strict, 72 rules): ASI-01 = 7 (was 6 in README; +1 AGENT-059) ASI-02 = 11 (was 9 in README; +2 AGENT-045/047) ASI-03 = 12 (was 4 in README; +8 AGENT-043/044/063/084 + existing tags) ASI-04 = 17 (was 7 in README; +10 incl. AGENT-049/058/060/062/083/085/048) ASI-05 = 5 (was 3 in README; +2 AGENT-046/061) ASI-06 = 3 (was 2 in README; +1 AGENT-116) ASI-07 = 1 (unchanged) ASI-08 = 3 (unchanged) ASI-09 = 10 (was 6 in README; +4 AGENT-064/113/118/119) ASI-10 = 3 (unchanged) TOTAL = 72 Changes: README.md — "53 rules" → "72 rules" ×3; per-ASI table refreshed README_CN.md — "40+ 规则" → "72 条规则" ×2; 中文 per-ASI table refreshed docs/RULES.md — Quick Reference grows 56 → 72 rows (added AGENT-043/ 044/045/046/047/049/058/059/060/061/062/063/064/083/ 084/085; AGENT-048 added in prior D-F commit) CLAUDE.md — engine.py tree comment "109 rules" → "92 rules" (RULE_CWE_MAPPING now 91 + AGENT-048 from D-F = 92) Closes README self-inconsistency where the per-ASI table summed to 44 but the headline cited 53. All public artifacts now agree at 72. Refs: D-A, D-F (HEADY_DECISIONS_NEEDED.md)
D-E E3 locked: the docs/index.html GitHub Pages site is frozen at v0.16.0 (49 rules, 1,142 tests, no v0.17+ feature coverage). Visitors landing on the site can be misled into thinking they're seeing current state. This commit adds a non-intrusive banner at the top of the page pointing visitors to the live README for current numbers. A full site refresh is planned as a separate PR. The banner uses inline CSS so it renders without depending on the rest of the v0.16-era stylesheet. Refs: D-E E3 (HEADY_DECISIONS_NEEDED.md)
Summary
Phase 3B integrated reconciliation. PR contains 11 commits spanning all of Heady's locked Phase 2 + 3A + 3B decisions (D-A, D-B, D-C, D-E E3, D-F, D-G, D-I, plus the precursor test-count + result-file fixes).
Commits
bca9af8506ebe34116fcdd59d484results/layer1_v0.19.0.json(was untracked)7e71dee2e121cacatalog.yamltotal_vulnerabilities 45 → 43 (oracle-encoded truth); set_breakdown 20/12/13 → 18/11/11 + mixed:3df58e4elabeled_samples.yamlheader 218 → 238 (236 positive + 2 negative); cascade to README/README_CN/CLAUDE/F1_REPRODUCTIONd293726engine.py:MCP_FINDING_RULES(dead code, preserve)c4bc053698b015b646b30docs/index.htmlGitHub Pages siteTest plan
cd packages/audit && poetry run python -m pytest ../../tests/ --tb=no -qafter each commitengine.load_rules()) returns 52 rules (was 51) including AGENT-048paper/main.textests/fixtures/skills/orskill_scanner.py(H2-A preserved)Decisions executed
Decisions still pending co-author sync (NOT executed)
INTEGRATED_RECONCILIATION_PLAN.md§B for co-author reviewdocs/BENCHMARK-RESULTS.mdalready existsOut of scope (deferred to future workstreams)
skill_scanner.pyWIP andtests/fixtures/skills/(active dev, separate PR)Do not merge yet
Per Phase 3B hard rule 8: Heady reviews and merges. Co-author sign-off (Yi Nian + Yue Zhao) is gating only D-D execution; reconciliation merge does not require co-author sign-off.
Reference deliverables (drafts under
/Users/heady/Documents/agent-audit/)ARTIFACT_INVENTORY.md(Phase 1)RULE_COUNT_VERIFICATION.md,AVB_VERIFICATION.md,OWASP_CATEGORY_VERIFICATION.md,BENCHMARK_RESULT_VERIFICATION.md,TEST_COUNT_VERIFICATION.md(Phase 2A)RECONCILIATION_PLAN.md,HEADY_DECISIONS_NEEDED.md(Phase 2B)EXECUTION_REPORT.md(Phase 2C)LATEST_F1_SEARCH.md(Phase 3A)F1_DRIFT_INVESTIGATION.md,PHASE_3A_FOLLOWUP_REPORT.md(Phase 3A follow-up)INTEGRATED_RECONCILIATION_PLAN.md(Phase 3B Section A/B/C/D)EXECUTION_REPORT_PHASE_3B.md(Phase 3B final, in progress)