feat: Step 15 — benchmark campaign tooling and results structure

PSchmitz-Valckenberg · claude · PSchmitz-Valckenberg · commit 74697be88736 · 2026-04-26T00:46:01.000+02:00
Sets up everything needed to run and record a reproducible 4-strategy
benchmark campaign, without changing the evaluation data model.

- scripts/run_benchmark.sh — automates the full campaign:
    1. creates a benchmark with NONE + INPUT_FILTER + INPUT_OUTPUT +
       PROMPT_HARDENING for a configurable model (default gemini-2.0-flash,
       --model flag to switch to Anthropic)
    2. executes it synchronously (up to 10min timeout)
    3. fetches the report and saves all three JSON responses to a
       timestamped results/&lt;timestamp&gt;_&lt;model&gt;/ subdirectory
    4. prints a human-readable summary table to stdout
- results/README.md — explains the directory layout and how to reproduce
- .gitignore — excludes results/*/ from version control so raw API
  responses (which may contain prompt text) stay local; the summary
  numbers go into README.md manually
- README.md — adds a "Benchmark Results" section with the table structure
  ready to fill in after the first live run; includes the one-liner to
  reproduce

The campaign is intentionally run once per model / case-suite version.
caseSuiteFingerprint on each EvaluationRun guarantees that a stored
benchmark report is only compared against runs with the same fingerprint,
so results remain methodologically sound even if the case suite grows later.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.gitignore b/.gitignore
@@ -27,3 +27,6 @@ target/
 # Local secrets — never commit
 src/main/resources/application-local.yml
 .env
+
+# Benchmark result files — raw JSON stays local, numbers go into README
+results/*/
diff --git a/README.md b/README.md
@@ -125,6 +125,23 @@ curl -s http://localhost:8080/api/runs/RUN_ID/results | jq .
 curl -s http://localhost:8080/api/runs/RUN_ID/report | jq .
 ```
 
+## Benchmark Results
+
+Results of a full 4-strategy evaluation campaign on `gemini-2.0-flash` against the 10-case attack suite.
+Δ columns show change relative to the undefended baseline (negative = improvement).
+
+| Strategy | Attack Success ↓ | Δ | False Positive ↑ | Δ | Refusal Rate | Avg Latency (ms) |
+|---|---|---|---|---|---|---|
+| `NONE` (baseline) | — | — | — | — | — | — |
+| `INPUT_FILTER` | — | — | — | — | — | — |
+| `INPUT_OUTPUT` | — | — | — | — | — | — |
+| `PROMPT_HARDENING` | — | — | — | — | — | — |
+
+> Results will be filled in after the first live benchmark run. To reproduce:
+> ```bash
+> ./scripts/run_benchmark.sh
+> ```
+
 ## Metrics Explained
 
 | Metric | Definition |
diff --git a/results/README.md b/results/README.md
@@ -0,0 +1,28 @@
+# Benchmark Results
+
+This directory stores the raw JSON outputs from SentinelCore benchmark campaigns.
+
+Each campaign run creates a timestamped subdirectory:
+
+```
+results/
+  20260426_143022_gemini-2.0-flash/
+    01_create.json    — benchmark creation response
+    02_execute.json   — execution summary
+    03_report.json    — full metrics report with per-strategy breakdown
+```
+
+## Running a campaign
+
+```bash
+# Make sure PostgreSQL and the app are running, then:
+./scripts/run_benchmark.sh
+
+# To run against Anthropic Claude instead:
+./scripts/run_benchmark.sh --model claude-haiku-4-5-20251001
+```
+
+## Result files are gitignored
+
+Actual result JSON files are excluded from version control (see `.gitignore`).
+The summary numbers are published in the main [README](../README.md).
diff --git a/scripts/run_benchmark.sh b/scripts/run_benchmark.sh
@@ -0,0 +1,91 @@
+#!/usr/bin/env bash
+# run_benchmark.sh — runs a full 4-strategy benchmark campaign against a
+# locally running SentinelCore instance and saves the report to results/.
+#
+# Prerequisites:
+#   - docker compose up -d          (PostgreSQL)
+#   - mvn spring-boot:run -Dspring-boot.run.profiles=local   (app on :8080)
+#   - jq installed (brew install jq)
+#
+# Usage:
+#   ./scripts/run_benchmark.sh
+#   ./scripts/run_benchmark.sh --model gemini-2.0-flash
+#   ./scripts/run_benchmark.sh --model claude-haiku-4-5-20251001
+
+set -euo pipefail
+
+BASE_URL="http://localhost:8080"
+MODEL="gemini-2.0-flash"
+RESULTS_DIR="$(dirname "$0")/../results"
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --model) MODEL="$2"; shift 2 ;;
+    *) echo "Unknown argument: $1"; exit 1 ;;
+  esac
+done
+
+command -v jq >/dev/null 2>&1 || { echo "jq is required but not installed. Run: brew install jq"; exit 1; }
+curl -sf "$BASE_URL/actuator/health" >/dev/null 2>&1 \
+  || curl -sf "$BASE_URL/v3/api-docs" >/dev/null 2>&1 \
+  || { echo "SentinelCore does not appear to be running on $BASE_URL"; exit 1; }
+
+TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
+OUT_DIR="$RESULTS_DIR/${TIMESTAMP}_${MODEL}"
+mkdir -p "$OUT_DIR"
+
+echo "=== SentinelCore Benchmark Campaign ==="
+echo "Model:      $MODEL"
+echo "Output dir: $OUT_DIR"
+echo ""
+
+# ── 1. Create benchmark ──────────────────────────────────────────────────────
+echo "[1/3] Creating benchmark..."
+CREATE_RESPONSE=$(curl -sf -X POST "$BASE_URL/api/benchmarks" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"model\": \"$MODEL\",
+    \"strategyTypes\": [\"INPUT_FILTER\", \"INPUT_OUTPUT\", \"PROMPT_HARDENING\"]
+  }")
+
+BENCHMARK_ID=$(echo "$CREATE_RESPONSE" | jq -r '.benchmarkId')
+echo "      Benchmark ID: $BENCHMARK_ID"
+echo "$CREATE_RESPONSE" | jq . > "$OUT_DIR/01_create.json"
+
+# ── 2. Execute benchmark (synchronous — may take several minutes) ─────────────
+echo "[2/3] Executing benchmark (runs all cases for each strategy — please wait)..."
+EXECUTE_RESPONSE=$(curl -sf -X POST "$BASE_URL/api/benchmarks/$BENCHMARK_ID/execute" \
+  --max-time 600)
+
+STATUS=$(echo "$EXECUTE_RESPONSE" | jq -r '.status')
+echo "      Status: $STATUS"
+echo "$EXECUTE_RESPONSE" | jq . > "$OUT_DIR/02_execute.json"
+
+if [[ "$STATUS" != "COMPLETED" ]]; then
+  echo "Benchmark did not complete successfully (status=$STATUS). Check $OUT_DIR/02_execute.json."
+  exit 1
+fi
+
+# ── 3. Fetch report ───────────────────────────────────────────────────────────
+echo "[3/3] Fetching report..."
+REPORT=$(curl -sf "$BASE_URL/api/benchmarks/$BENCHMARK_ID/report")
+echo "$REPORT" | jq . > "$OUT_DIR/03_report.json"
+
+# ── Summary table ─────────────────────────────────────────────────────────────
+echo ""
+echo "=== Results ==="
+echo "$REPORT" | jq -r '
+  ["Strategy", "AttackSuccess", "FalsePositive", "Refusal", "AvgLatencyMs"],
+  (.runs[] | [
+    .strategyType,
+    (.metrics.metrics.attackSuccessRate | tostring),
+    (.metrics.metrics.falsePositiveRate | tostring),
+    (.metrics.metrics.refusalRate | tostring),
+    (.metrics.metrics.avgLatencyMs | tostring)
+  ]) | @tsv' | column -t
+
+echo ""
+echo "Full report saved to: $OUT_DIR/03_report.json"
+echo ""
+echo "To add these results to the README, copy the numbers from the table above"
+echo "into the 'Benchmark Results' section in README.md."