Skip to content

Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope)#7

Open
HeadyZhang wants to merge 11 commits into
feature/compliance-reportfrom
reconciliation/2026-06-06-paper-1-1-prep
Open

Reconciliation: unify public surface to canonical ground truth for Paper 1.1 citation prerequisite (minimum scope)#7
HeadyZhang wants to merge 11 commits into
feature/compliance-reportfrom
reconciliation/2026-06-06-paper-1-1-prep

Conversation

@HeadyZhang

@HeadyZhang HeadyZhang commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

Phase 3B integrated reconciliation. PR contains 11 commits spanning all of Heady's locked Phase 2 + 3A + 3B decisions (D-A, D-B, D-C, D-E E3, D-F, D-G, D-I, plus the precursor test-count + result-file fixes).

Commits

# Commit Decision What
1 bca9af8 Phase 2C README.md test badge + dev block 1502/1239 → 1541
2 506ebe3 Phase 2C README_CN.md test badge → 1541
3 4116fcd Phase 2C CLAUDE.md test count → 1542/1541
4 d59d484 Phase 3A Issue 1 Track results/layer1_v0.19.0.json (was untracked)
5 7e71dee Phase 3B Step 1 CI gate 0.87 → 0.75 with explicit GT-coverage rationale
6 2e121ca D-B catalog.yaml total_vulnerabilities 45 → 43 (oracle-encoded truth); set_breakdown 20/12/13 → 18/11/11 + mixed:3
7 df58e4e D-C labeled_samples.yaml header 218 → 238 (236 positive + 2 negative); cascade to README/README_CN/CLAUDE/F1_REPRODUCTION
8 d293726 D-G Add TODO comment to engine.py:MCP_FINDING_RULES (dead code, preserve)
9 c4bc053 D-F Promote AGENT-048 first-class: YAML (×2 mirror) + RULE_CWE_MAPPING + docs/RULES.md row. CWE-863 per scanner (NOT CWE-829 per Heady — flagged in commit message for override)
10 698b015 D-A Rule count 53 → 72 across README ×3 + README_CN ×2 + Quick Reference 56 → 72 rows + per-ASI table refreshed in EN/CN + CLAUDE.md engine.py comment
11 b646b30 D-E E3 v0.16-frozen banner on docs/index.html GitHub Pages site

Test plan

  • cd packages/audit && poetry run python -m pytest ../../tests/ --tb=no -q after each commit
  • 1541 passed / 1 skipped baseline maintained from commit 1 through commit 11
  • Live YAML load (engine.load_rules()) returns 52 rules (was 51) including AGENT-048
  • catalog.yaml + oracle.yaml sum 43 verified post-D-B
  • labeled_samples.yaml loaded count = 238 verified post-D-C
  • No working-tree changes to paper/main.tex
  • No working-tree changes to tests/fixtures/skills/ or skill_scanner.py (H2-A preserved)

Decisions executed

Decision Status Notes
D-A (rule count) ✓ 72 (was D2-strict=71, lifted to 72 post-D-F) README per-ASI table self-consistency restored
D-B (AVB count) ✓ 43 oracle-encoded set_breakdown convention noted in commit
D-C (label count) ✓ 236 positive + 2 negative F1_REPRODUCTION.md updated
D-E E3 (banner) Full site refresh deferred to separate PR
D-F (AGENT-048) ✓ promoted CWE divergence flagged: scanner uses CWE-863, Heady locked CWE-829. Commit uses scanner's truth; Heady can override
D-G (dead code) ✓ TODO comment MCP_FINDING_RULES preserved per locked decision
D-I (CLAUDE.md) ✓ rolled into D-A + D-C + test-count commits

Decisions still pending co-author sync (NOT executed)

Decision Status Why deferred
D-D (arxiv v2 strategy) Pending Yi Nian + Yue Zhao paper/main.tex untouched. Verbatim before/after diffs documented in INTEGRATED_RECONCILIATION_PLAN.md §B for co-author review
D-E (full GitHub Pages refresh) Banner ships; full refresh = separate PR Scope/scheduling
D-H (paper-era benchmark archive) Heady lean H2 = no investment Documentation gap in docs/BENCHMARK-RESULTS.md already exists

Out of scope (deferred to future workstreams)

  • TrustAgent audit memo "0/65 agent-specific rules" finding (different project)
  • skill_scanner.py WIP and tests/fixtures/skills/ (active dev, separate PR)
  • AGENT-048 CWE-829 vs CWE-863 reconciliation (requires scanner change; flagged for Heady override)
  • "9 open-source targets" README table reproducibility rebuild

Do not merge yet

Per Phase 3B hard rule 8: Heady reviews and merges. Co-author sign-off (Yi Nian + Yue Zhao) is gating only D-D execution; reconciliation merge does not require co-author sign-off.

Reference deliverables (drafts under /Users/heady/Documents/agent-audit/)

  • ARTIFACT_INVENTORY.md (Phase 1)
  • RULE_COUNT_VERIFICATION.md, AVB_VERIFICATION.md, OWASP_CATEGORY_VERIFICATION.md, BENCHMARK_RESULT_VERIFICATION.md, TEST_COUNT_VERIFICATION.md (Phase 2A)
  • RECONCILIATION_PLAN.md, HEADY_DECISIONS_NEEDED.md (Phase 2B)
  • EXECUTION_REPORT.md (Phase 2C)
  • LATEST_F1_SEARCH.md (Phase 3A)
  • F1_DRIFT_INVESTIGATION.md, PHASE_3A_FOLLOWUP_REPORT.md (Phase 3A follow-up)
  • INTEGRATED_RECONCILIATION_PLAN.md (Phase 3B Section A/B/C/D)
  • EXECUTION_REPORT_PHASE_3B.md (Phase 3B final, in progress)

Drift-only correction. README badge said `tests-1502%20passed` and
the Development block said `# 1239 tests`. Live `poetry run pytest
tests/` reports 1541 passed (1 skipped, 1542 collected). Bringing
README into self-consistency at a point in time.

No canonical-choice taken. Larger reconciliation deferred per
HEADY_DECISIONS_NEEDED.md.
…EADME)

Drift-only correction. README_CN badge said `tests-1502%20passed`.
Live test count is 1541 passed. Matches the English README badge
bumped in the previous commit.
… passing)

Drift-only correction. CLAUDE.md said "1502 collected (1491 passing
on dev env; pytest-asyncio plugin required for full pass)" — but
live pytest run produces 1542 collected, 1541 passing, 1 skipped
(no pytest-asyncio install required). Also updates the per-line
"1239+ tests" comment in the tree diagram to "1541+ tests".
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 88cae467-534d-4efe-aba0-49ec81496eae

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR releases agent-audit v0.19.0, updating all public documentation with revised benchmark metrics (F1 0.778, Precision 73.58%, Recall 82.63% from 81 labeled samples), refreshed test counts (1541+), detailed F1 reproducibility guidance, and six example security audit materials demonstrating real-world agent-audit usage including MiroFish scans, OpenClaw reports, SQL injection PoC, vulnerable agent demo, and sample report schema.

Changes

v0.19.0 Release: Metrics Documentation and Security Audit Examples

Layer / File(s) Summary
Release version and metrics updates across documentation
CLAUDE.md, README.md, README_CN.md, docs/BENCHMARK-RESULTS.md, docs/COMPETITIVE-COMPARISON.md, docs/index.html
Updated version from v0.18.2 to v0.19.0, refreshed all benchmark metrics (F1: 0.778, Precision: 73.58%, Recall: 82.63%), bumped test counts from 1239+ to 1541+, and reframed historical (AVB-19) vs current benchmark presentation across all public documentation sites.
F1 reproducibility documentation and derivation guide
docs/F1_REPRODUCTION.md
Comprehensive guide documenting headline F1 (0.778) as raw, reproducible via tests/benchmark/precision_recall.py against labeled_samples.yaml (GT v2.2); includes reproduction command, exact metrics (TP/FP/FN), derivation steps for prior adjusted F1 (0.842), false positive breakdown by rule range, and CI gate behavior clarification.
Examples directory overview and sample report schema
examples/README.md, examples/05-sample-report/sample_data.json
Added examples/README.md documenting six example folders with descriptions and OWASP ASI coverage matrix; added sample_data.json with complete audit report schema (metadata, findings, OWASP compliance, CVSS mappings).
Example security audit reports and scan outputs
examples/01-mirofish-audit/mirofish_scan.json, examples/03-clawskills-deep-scan/*
Three example audit materials: MiroFish JSON scan report with AGENT-010 system prompt injection, AGENT-034 tool misuse, and supply-chain findings; ClawSkills OWASP ASI report; ClawSkills full security report with executive summary, deep dive findings, triage breakdown, and trust model notes.
SQL injection PoC: vulnerable code demo and HackerOne-style report
examples/04-sql-injection-poc/exploit_demo.py, examples/04-sql-injection-poc/poc_report.md
Complete proof-of-concept for SQL injection via postgres-mcp-server: Python demo with VulnerableExecuteQuery class showing parameterization bypass, regex-guard limitations, and scenario execution (baseline, exfiltration, blocked cases); accompanying HackerOne-style report with vulnerability analysis, attack vectors, impact/CVSS, and remediation with agent-audit v0.19.0 detection attribution.
Vulnerable agent demo with LangChain tools and MCP configuration
examples/06-vulnerable-agent-demo/agent.py, examples/06-vulnerable-agent-demo/mcp_config.json
Demonstration agent module with three unsafe LangChain tools (subprocess shell command execution, unparameterized SQL queries, unrestricted HTTP requests), hard-coded OpenAI API key, and direct user input interpolation in system prompt; paired with MCP config containing filesystem server with hardcoded credentials and untrusted HTTP server reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hop along, dear reviewer, through metrics new,
v0.19.0 shines with F1 0.778 true,
From PoCs to examples, examples galore,
Agent-audit's findings are harder to ignore.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main purpose of the PR: unifying public documentation and metrics to align with canonical ground truth (test counts, benchmark results, F1 metrics) as a prerequisite for Paper 1.1 citation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch reconciliation/2026-06-06-paper-1-1-prep

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HeadyZhang HeadyZhang changed the base branch from master to feature/compliance-report June 6, 2026 18:11
@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

README.md:101 and docs/F1_REPRODUCTION.md cite this file as the
checked-in reproducibility artifact for the v0.19.0 / 2026-05-30
benchmark snapshot:

  TP 195 / FP 70 / FN 41
  Precision 73.58% / Recall 82.63% / F1 0.7784

The file existed only in working trees as untracked output. Cold
clones could not access it, breaking the README's "result file
checked in at [results/layer1_v0.19.0.json]" claim. This commit
brings the repo state into alignment with the public-doc claim.

.gitignore does not exclude this path (verified via
`git check-ignore`). The Phase 3A search located the drift; no
upstream regeneration needed.
The v0.17+ rule expansion (AGENT-053..064, AGENT-110..120) added
detections that fire on real patterns but lack GT v2.2 labels.
This temporarily lowers measurable F1:

  - raw F1 ~ 0.778 (cold-clone reproducible via
    `python tests/benchmark/precision_recall.py`)
  - adjusted F1 ~ 0.842 (excluding 41 FPs from GT-unlabeled v0.17+
    rules; documented in docs/F1_REPRODUCTION.md)

Gate will be restored to 0.87 as GT v2.3 catches up to v0.17+
rule coverage.

Rationale for explicit commit (not silent edit): the prior gate
of 0.87 reflected the pre-v0.17 detection surface. Lowering the
gate without record would obscure that the baseline shifted; this
commit makes the shift explicit and bounded by an expected
restoration condition.
…D-B)

D-B locked: oracle.yaml is the source of truth (it's what run_eval.py
reads); catalog.yaml's `statistics.total_vulnerabilities` field was a
stale 45.

Recount across all 23 oracle.yaml files by taxonomy.set_class:
  Set A (Injection/RCE)  : 18 (non-noise)
  Set B (MCP/Components) : 11
  Set C (Data/Auth)      : 11
  mixed (noise T12/T13)  :  3
  Total                  : 43

Changes:
  version 1.1 → 1.2
  date_updated 2026-03-04 → 2026-06-06
  total_vulnerabilities 45 → 43
  set_breakdown A 20→18, B 12→11, C 13→11, +mixed: 3

Note on set_breakdown convention: the prior 20/12/13 totals presumably
included noise samples apportioned to A/B/C. The recount uses each
sample's oracle.yaml taxonomy.set_class strictly, with noise samples
under a new "mixed" key (matching the catalog's existing noise entries
where set: "mixed").

Refs: D-B (HEADY_DECISIONS_NEEDED.md)
The YAML header field `total_vulnerabilities: 218` was stale.
precision_recall.py loads 238 labels (236 positive + 2 negative;
the 2 negatives are FP-discrimination cases used to verify the
scanner does NOT flag those lines). This commit decomposes the
header into 236 positive + 2 negative to clarify the semantic
distinction.

Cascading propagation:
  labeled_samples.yaml header  : 218 → 238  (236 positive + 2 negative)
  README.md:95 headline         : "218 vulnerability labels" → "236 positive + 2 negative"
  README.md:229 details         : "218 labels" → "236 positive + 2 negative labels"
  README_CN.md:96 (Chinese)     : "218 条漏洞标签" → "236 条阳性 + 2 条阴性标签"
  README_CN.md:210 (Chinese)    : "218 条标签" → "236 条阳性 + 2 条阴性标签"
  CLAUDE.md:11 metrics line     : "218 labels" → "236 positive + 2 negative labels"
  docs/F1_REPRODUCTION.md       : header line refreshed (218 stale → 238 reconciled)

Also renames "agent-vuln-bench GT v2.2" → "ground-truth dataset v2.2"
in README headlines, since AVB (catalog.yaml) and the labeled-samples
GT are distinct artifacts (see AVB_VERIFICATION.md from Phase 2A).

Refs: D-C (HEADY_DECISIONS_NEEDED.md)
D-G locked: do not delete the MCP_FINDING_RULES dict (zero usages
across packages/audit/) — preserve in case a future contributor
wires scanners through the engine path. Add a TODO comment so the
next reader doesn't burn cycles re-investigating the lack of
usages.

Refs: D-G (HEADY_DECISIONS_NEEDED.md)
…ndary (D-F)

Per D-F locked decision, AGENT-048 (emitted by privilege_scanner.py
since v0.8.0) is now documented as a first-class rule. Previously
the rule fired at runtime but had no YAML definition, no
RULE_CWE_MAPPING entry, and no docs/RULES.md row — making it
invisible to users browsing public docs while still affecting their
scan output.

Verified alignment with privilege_scanner.py:1058-1059 per hard rule 5:
  owasp_id  = "ASI-04"   matches D-F locked decision
  cwe_id    = "CWE-863"  DOES NOT match D-F's locked "CWE-829"

This commit uses the scanner's actual CWE (CWE-863) per hard rule 5
("verify alignment before commit"). The scanner is the source of
truth; the YAML and RULE_CWE_MAPPING follow it. If Heady prefers
CWE-829, both the scanner emission and the YAML/mapping need to be
updated in a follow-up — that's out of scope for reconciliation
since it changes scan output.

Files changed:
  rules/builtin/asi_coverage_v030.yaml                              (+1 rule entry)
  packages/audit/agent_audit/rules/builtin/asi_coverage_v030.yaml   (mirror)
  packages/audit/agent_audit/rules/engine.py                         (RULE_CWE_MAPPING +1)
  docs/RULES.md                                                      (Quick Reference +1 row)

Live verification: `engine.load_rules()` now returns 52 rules (was 51),
including AGENT-048 with title "Extension Permission Boundary Violation",
severity high, category supply_chain_agentic, owasp_agentic_id ASI-04,
cwe_id CWE-863.

Refs: D-F (HEADY_DECISIONS_NEEDED.md)
Per D-A (locked: D2-strict — default-profile, runtime-emittable,
CWE-mapped rules). Pre-D-F this was 71; D-F (prior commit c4bc053)
adds AGENT-048 to RULE_CWE_MAPPING, lifting D2-strict to 72.

Per-ASI breakdown (D2-strict, 72 rules):
  ASI-01 =  7   (was 6 in README; +1 AGENT-059)
  ASI-02 = 11   (was 9 in README; +2 AGENT-045/047)
  ASI-03 = 12   (was 4 in README; +8 AGENT-043/044/063/084 + existing tags)
  ASI-04 = 17   (was 7 in README; +10 incl. AGENT-049/058/060/062/083/085/048)
  ASI-05 =  5   (was 3 in README; +2 AGENT-046/061)
  ASI-06 =  3   (was 2 in README; +1 AGENT-116)
  ASI-07 =  1   (unchanged)
  ASI-08 =  3   (unchanged)
  ASI-09 = 10   (was 6 in README; +4 AGENT-064/113/118/119)
  ASI-10 =  3   (unchanged)
  TOTAL  = 72

Changes:
  README.md       — "53 rules" → "72 rules" ×3; per-ASI table refreshed
  README_CN.md    — "40+ 规则" → "72 条规则" ×2; 中文 per-ASI table refreshed
  docs/RULES.md   — Quick Reference grows 56 → 72 rows (added AGENT-043/
                    044/045/046/047/049/058/059/060/061/062/063/064/083/
                    084/085; AGENT-048 added in prior D-F commit)
  CLAUDE.md       — engine.py tree comment "109 rules" → "92 rules"
                    (RULE_CWE_MAPPING now 91 + AGENT-048 from D-F = 92)

Closes README self-inconsistency where the per-ASI table summed to
44 but the headline cited 53. All public artifacts now agree at 72.

Refs: D-A, D-F (HEADY_DECISIONS_NEEDED.md)
D-E E3 locked: the docs/index.html GitHub Pages site is frozen at
v0.16.0 (49 rules, 1,142 tests, no v0.17+ feature coverage).
Visitors landing on the site can be misled into thinking they're
seeing current state.

This commit adds a non-intrusive banner at the top of the page
pointing visitors to the live README for current numbers. A full
site refresh is planned as a separate PR.

The banner uses inline CSS so it renders without depending on the
rest of the v0.16-era stylesheet.

Refs: D-E E3 (HEADY_DECISIONS_NEEDED.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant