Skip to content

Commit 74697be

Browse files
feat: Step 15 — benchmark campaign tooling and results structure
Sets up everything needed to run and record a reproducible 4-strategy benchmark campaign, without changing the evaluation data model. - scripts/run_benchmark.sh — automates the full campaign: 1. creates a benchmark with NONE + INPUT_FILTER + INPUT_OUTPUT + PROMPT_HARDENING for a configurable model (default gemini-2.0-flash, --model flag to switch to Anthropic) 2. executes it synchronously (up to 10min timeout) 3. fetches the report and saves all three JSON responses to a timestamped results/<timestamp>_<model>/ subdirectory 4. prints a human-readable summary table to stdout - results/README.md — explains the directory layout and how to reproduce - .gitignore — excludes results/*/ from version control so raw API responses (which may contain prompt text) stay local; the summary numbers go into README.md manually - README.md — adds a "Benchmark Results" section with the table structure ready to fill in after the first live run; includes the one-liner to reproduce The campaign is intentionally run once per model / case-suite version. caseSuiteFingerprint on each EvaluationRun guarantees that a stored benchmark report is only compared against runs with the same fingerprint, so results remain methodologically sound even if the case suite grows later. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 1cacd4e commit 74697be

4 files changed

Lines changed: 139 additions & 0 deletions

File tree

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,6 @@ target/
2727
# Local secrets — never commit
2828
src/main/resources/application-local.yml
2929
.env
30+
31+
# Benchmark result files — raw JSON stays local, numbers go into README
32+
results/*/

README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,23 @@ curl -s http://localhost:8080/api/runs/RUN_ID/results | jq .
125125
curl -s http://localhost:8080/api/runs/RUN_ID/report | jq .
126126
```
127127

128+
## Benchmark Results
129+
130+
Results of a full 4-strategy evaluation campaign on `gemini-2.0-flash` against the 10-case attack suite.
131+
Δ columns show change relative to the undefended baseline (negative = improvement).
132+
133+
| Strategy | Attack Success ↓ | Δ | False Positive ↑ | Δ | Refusal Rate | Avg Latency (ms) |
134+
|---|---|---|---|---|---|---|
135+
| `NONE` (baseline) |||||||
136+
| `INPUT_FILTER` |||||||
137+
| `INPUT_OUTPUT` |||||||
138+
| `PROMPT_HARDENING` |||||||
139+
140+
> Results will be filled in after the first live benchmark run. To reproduce:
141+
> ```bash
142+
> ./scripts/run_benchmark.sh
143+
> ```
144+
128145
## Metrics Explained
129146
130147
| Metric | Definition |

results/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Benchmark Results
2+
3+
This directory stores the raw JSON outputs from SentinelCore benchmark campaigns.
4+
5+
Each campaign run creates a timestamped subdirectory:
6+
7+
```
8+
results/
9+
20260426_143022_gemini-2.0-flash/
10+
01_create.json — benchmark creation response
11+
02_execute.json — execution summary
12+
03_report.json — full metrics report with per-strategy breakdown
13+
```
14+
15+
## Running a campaign
16+
17+
```bash
18+
# Make sure PostgreSQL and the app are running, then:
19+
./scripts/run_benchmark.sh
20+
21+
# To run against Anthropic Claude instead:
22+
./scripts/run_benchmark.sh --model claude-haiku-4-5-20251001
23+
```
24+
25+
## Result files are gitignored
26+
27+
Actual result JSON files are excluded from version control (see `.gitignore`).
28+
The summary numbers are published in the main [README](../README.md).

scripts/run_benchmark.sh

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
#!/usr/bin/env bash
2+
# run_benchmark.sh — runs a full 4-strategy benchmark campaign against a
3+
# locally running SentinelCore instance and saves the report to results/.
4+
#
5+
# Prerequisites:
6+
# - docker compose up -d (PostgreSQL)
7+
# - mvn spring-boot:run -Dspring-boot.run.profiles=local (app on :8080)
8+
# - jq installed (brew install jq)
9+
#
10+
# Usage:
11+
# ./scripts/run_benchmark.sh
12+
# ./scripts/run_benchmark.sh --model gemini-2.0-flash
13+
# ./scripts/run_benchmark.sh --model claude-haiku-4-5-20251001
14+
15+
set -euo pipefail
16+
17+
BASE_URL="http://localhost:8080"
18+
MODEL="gemini-2.0-flash"
19+
RESULTS_DIR="$(dirname "$0")/../results"
20+
21+
while [[ $# -gt 0 ]]; do
22+
case "$1" in
23+
--model) MODEL="$2"; shift 2 ;;
24+
*) echo "Unknown argument: $1"; exit 1 ;;
25+
esac
26+
done
27+
28+
command -v jq >/dev/null 2>&1 || { echo "jq is required but not installed. Run: brew install jq"; exit 1; }
29+
curl -sf "$BASE_URL/actuator/health" >/dev/null 2>&1 \
30+
|| curl -sf "$BASE_URL/v3/api-docs" >/dev/null 2>&1 \
31+
|| { echo "SentinelCore does not appear to be running on $BASE_URL"; exit 1; }
32+
33+
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
34+
OUT_DIR="$RESULTS_DIR/${TIMESTAMP}_${MODEL}"
35+
mkdir -p "$OUT_DIR"
36+
37+
echo "=== SentinelCore Benchmark Campaign ==="
38+
echo "Model: $MODEL"
39+
echo "Output dir: $OUT_DIR"
40+
echo ""
41+
42+
# ── 1. Create benchmark ──────────────────────────────────────────────────────
43+
echo "[1/3] Creating benchmark..."
44+
CREATE_RESPONSE=$(curl -sf -X POST "$BASE_URL/api/benchmarks" \
45+
-H "Content-Type: application/json" \
46+
-d "{
47+
\"model\": \"$MODEL\",
48+
\"strategyTypes\": [\"INPUT_FILTER\", \"INPUT_OUTPUT\", \"PROMPT_HARDENING\"]
49+
}")
50+
51+
BENCHMARK_ID=$(echo "$CREATE_RESPONSE" | jq -r '.benchmarkId')
52+
echo " Benchmark ID: $BENCHMARK_ID"
53+
echo "$CREATE_RESPONSE" | jq . > "$OUT_DIR/01_create.json"
54+
55+
# ── 2. Execute benchmark (synchronous — may take several minutes) ─────────────
56+
echo "[2/3] Executing benchmark (runs all cases for each strategy — please wait)..."
57+
EXECUTE_RESPONSE=$(curl -sf -X POST "$BASE_URL/api/benchmarks/$BENCHMARK_ID/execute" \
58+
--max-time 600)
59+
60+
STATUS=$(echo "$EXECUTE_RESPONSE" | jq -r '.status')
61+
echo " Status: $STATUS"
62+
echo "$EXECUTE_RESPONSE" | jq . > "$OUT_DIR/02_execute.json"
63+
64+
if [[ "$STATUS" != "COMPLETED" ]]; then
65+
echo "Benchmark did not complete successfully (status=$STATUS). Check $OUT_DIR/02_execute.json."
66+
exit 1
67+
fi
68+
69+
# ── 3. Fetch report ───────────────────────────────────────────────────────────
70+
echo "[3/3] Fetching report..."
71+
REPORT=$(curl -sf "$BASE_URL/api/benchmarks/$BENCHMARK_ID/report")
72+
echo "$REPORT" | jq . > "$OUT_DIR/03_report.json"
73+
74+
# ── Summary table ─────────────────────────────────────────────────────────────
75+
echo ""
76+
echo "=== Results ==="
77+
echo "$REPORT" | jq -r '
78+
["Strategy", "AttackSuccess", "FalsePositive", "Refusal", "AvgLatencyMs"],
79+
(.runs[] | [
80+
.strategyType,
81+
(.metrics.metrics.attackSuccessRate | tostring),
82+
(.metrics.metrics.falsePositiveRate | tostring),
83+
(.metrics.metrics.refusalRate | tostring),
84+
(.metrics.metrics.avgLatencyMs | tostring)
85+
]) | @tsv' | column -t
86+
87+
echo ""
88+
echo "Full report saved to: $OUT_DIR/03_report.json"
89+
echo ""
90+
echo "To add these results to the README, copy the numbers from the table above"
91+
echo "into the 'Benchmark Results' section in README.md."

0 commit comments

Comments
 (0)