Skip to content

feat: adversarial benchmark harness: 50 payloads, 27 tests, policy matrix coverage#12

Open
Pranjal0410 wants to merge 1 commit intoc2siorg:mainfrom
Pranjal0410:feat/benchmark-harness
Open

feat: adversarial benchmark harness: 50 payloads, 27 tests, policy matrix coverage#12
Pranjal0410 wants to merge 1 commit intoc2siorg:mainfrom
Pranjal0410:feat/benchmark-harness

Conversation

@Pranjal0410
Copy link
Copy Markdown
Contributor

@Pranjal0410 Pranjal0410 commented Mar 20, 2026

What this PR adds

A benchmark harness that evaluates scanner performance against curated adversarial and benign datasets, producing precision/recall/F1 metrics, per-category breakdowns, and latency statistics.

This gives the team a regression safety net — any scanner or policy change can be validated against the benchmark before merging.

Why this matters

Right now we have scanners being built (#10, #5) but no way to measure how well they actually perform. This harness answers: "What's the detection rate? What's the false positive rate? Which attack categories are we missing?"

Policy matrix alignment

The dataset is mapped to the v1 detection policies from the taxonomy matrix:

Policy (taxonomy matrix) Scope Dataset category
Instruction override detection on_prompt instruction_override (6 payloads)
Role escalation detection on_prompt role_hijack (4 payloads)
Jailbreak pattern library on_prompt data_exfiltration (5 payloads)
Embedded instruction detection on_context context_manipulation (3 payloads)
Parameter injection scan on_tool_call tool_abuse (3 payloads)
Write-time content scan on_memory memory_poisoning (2 payloads)

Tests explicitly validate this coverage (TestPolicyMatrixCoverage).

Components

Benchmark Runner (acf_sdk/benchmarks/harness.py)

  • Scanner-agnostic — accepts any callable that takes ScanInput and returns an action
  • Confusion matrix, precision, recall, F1, accuracy
  • Per-category breakdown with recall and average risk scores
  • Latency percentiles (p50/p95/p99)
  • JSON export for CI integration
  • Quality gate — exits non-zero if F1 drops below threshold

Curated Dataset (acf_sdk/benchmarks/dataset.py)

  • 25 malicious payloads across 7 categories
  • 25 benign payloads including 10 hard negatives ("ignore previous commits", "system prompt debugging", "override a method in Python")
  • Multiple input types: prompts, RAG documents, tool outputs, memory writes

CLI Runner (acf_sdk/benchmarks/run_benchmark.py)

python -m acf_sdk.benchmarks.run_benchmark
python -m acf_sdk.benchmarks.run_benchmark --backend sentence-transformer
python -m acf_sdk.benchmarks.run_benchmark --output results.json

Sample Output (TF-IDF backend)

============================================================
ACF-SDK BENCHMARK REPORT
Payloads: 50 (25 malicious, 25 benign)
TP: 12 FP: 8 TN: 17 FN: 13
Precision: 0.6000
Recall: 0.4800
F1 Score: 0.5333
Latency — mean: 0.53ms p50: 0.56ms p95: 0.62ms p99: 0.78ms
Category Total TP FN Recall Avg Score
instruction_override 6 2 4 0.3333 0.6537
data_exfiltration 5 5 0 1.0000 0.9374
role_hijack 4 4 0 1.0000 0.9075
context_manipulation 3 0 3 0.0000 0.6441
tool_abuse 3 1 2 0.3333 0.6815
memory_poisoning 2 0 2 0.0000 0.5344
encoding_evasion 2 0 2 0.0000 0.5756

The TF-IDF backend catches data exfiltration and role hijack well (100% recall) but misses context manipulation and encoding evasion — expected, since TF-IDF relies on keyword overlap. The sentence-transformer backend will improve these numbers. The harness makes this gap visible and measurable.

Tests — 27 passing

pytest tests/test_benchmark.py -v   # 27 passed
  • Dataset integrity (unique IDs, required fields, balance)
  • Report correctness (confusion matrix sums, metric bounds)
  • Per-category breakdown
  • Latency stats
  • JSON export validity
  • F1 quality gate
  • Policy matrix coverage (validates dataset covers v1 detection policies)
  • Dataset balance + hard negative coverage

Files

acf_sdk/benchmarks/
├── init.py
├── harness.py # BenchmarkRunner + BenchmarkReport
├── dataset.py # 50 curated payloads
└── run_benchmark.py # CLI entry point
tests/
└── test_benchmark.py # 27 tests

Next steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant