feat: Step 17 — repetitions and confidence intervals for benchmarks by PSchmitz-Valckenberg · Pull Request #17 · PSchmitz-Valckenberg/sentinelcore

PSchmitz-Valckenberg · 2026-04-27T00:16:06Z

Adds statistical rigor to the benchmark pipeline. Each strategy can now be run N times; the report exposes mean and stddev per metric so noise-floor effects (e.g. 10% attack-success delta on N=1) are distinguishable from real differences.

Schema (V6):

benchmarks.repetitions INTEGER NOT NULL DEFAULT 1
benchmark_runs PK changed from (benchmark_id, strategy_type) to (benchmark_id, strategy_type, repetition_index) — allows N rows per strategy per benchmark

API:

BenchmarkCreateRequest: optional repetitions field (1–10, default 1)
BenchmarkReportResponse: includes repetitions count
RunComparisonEntry: runId → runIds list, plus new aggregated field
AggregatedStrategyMetrics: mean + stddev (null when N=1) for attackSuccessRate, falsePositiveRate, refusalRate, avgLatencyMs

Service:

BenchmarkService.executeBenchmark: inner loop runs each strategy N times, recording a BenchmarkRun per repetition
BenchmarkService.getReport: groups runs by strategy, aggregates using population stddev; delta vs baseline uses first NONE run as representative (unchanged semantics for N=1)
mean/stddev helpers are package-visible statics for unit testing

Script:

--repetitions N flag (default 3); summary table shows mean + stddev columns; N=1 stddev shown as "n/a"

Tests:

8 new unit tests covering mean, stddev edge cases, N=1 null stddev, N=3 aggregation with known values

Adds statistical rigor to the benchmark pipeline. Each strategy can now be run N times; the report exposes mean and stddev per metric so noise-floor effects (e.g. 10% attack-success delta on N=1) are distinguishable from real differences. Schema (V6): - benchmarks.repetitions INTEGER NOT NULL DEFAULT 1 - benchmark_runs PK changed from (benchmark_id, strategy_type) to (benchmark_id, strategy_type, repetition_index) — allows N rows per strategy per benchmark API: - BenchmarkCreateRequest: optional repetitions field (1–10, default 1) - BenchmarkReportResponse: includes repetitions count - RunComparisonEntry: runId → runIds list, plus new aggregated field - AggregatedStrategyMetrics: mean + stddev (null when N=1) for attackSuccessRate, falsePositiveRate, refusalRate, avgLatencyMs Service: - BenchmarkService.executeBenchmark: inner loop runs each strategy N times, recording a BenchmarkRun per repetition - BenchmarkService.getReport: groups runs by strategy, aggregates using population stddev; delta vs baseline uses first NONE run as representative (unchanged semantics for N=1) - mean/stddev helpers are package-visible statics for unit testing Script: - --repetitions N flag (default 3); summary table shows mean + stddev columns; N=1 stddev shown as "n/a" Tests: - 8 new unit tests covering mean, stddev edge cases, N=1 null stddev, N=3 aggregation with known values

Copilot

Pull request overview

Adds repetitions and aggregated statistics (mean/stddev) to the benchmark pipeline so benchmark comparisons can distinguish noise from real strategy deltas.

Changes:

Introduces benchmarks.repetitions and updates benchmark_runs PK to include repetition_index (V6 migration).
Updates benchmark execution/reporting to run each strategy N times and return per-strategy aggregated metrics in the report API.
Updates CLI script output formatting and adds unit tests for mean/stddev/aggregation helpers.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/test/java/com/sentinelcore/service/BenchmarkStatisticsTest.java	Adds unit tests for mean/stddev and aggregation behavior (including N=1 null stddev).
src/main/resources/db/migration/V6__add_repetitions.sql	Adds `benchmarks.repetitions` and extends `benchmark_runs` PK with `repetition_index`.
src/main/java/com/sentinelcore/service/BenchmarkService.java	Executes each strategy multiple times; groups runs by strategy; aggregates metrics; exposes mean/stddev helpers.
src/main/java/com/sentinelcore/dto/RunComparisonEntry.java	Changes `runId` to `runIds` and adds `aggregated` metrics payload.
src/main/java/com/sentinelcore/dto/BenchmarkReportResponse.java	Includes `repetitions` in the report response.
src/main/java/com/sentinelcore/dto/BenchmarkCreateRequest.java	Adds optional validated `repetitions` field and a defaulting helper.
src/main/java/com/sentinelcore/dto/AggregatedStrategyMetrics.java	New DTO carrying mean/stddev per metric (stddev nullable for N=1).
src/main/java/com/sentinelcore/domain/entity/BenchmarkRun.java	Adds `repetitionIndex` to persist repetition identity per run.
src/main/java/com/sentinelcore/domain/entity/Benchmark.java	Adds persisted `repetitions` field (default 1).
src/main/java/com/sentinelcore/controller/BenchmarkController.java	Passes repetitions through to `BenchmarkService.createBenchmark`.
scripts/run_benchmark.sh	Adds `--repetitions` flag and updates summary output to use aggregated fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@orderby

Four issues from PR #17 review: 1. Script LABEL default mismatched with usage examples — changed default from gemini-2.0-flash to gemini-2.5-flash to align with the documented examples. 2. Script showed stddev only for ASR — summary table now split into two sections (mean / stddev) so all four metrics (ASR, FPR, refusal, latency) show both values. stddev prints "null" when N=1. 3. BenchmarkRun list retrieval order not guaranteed — added @orderby("repetitionIndex ASC") on Benchmark.runs so the list is always in insertion order from the DB. getReport() now tracks rep-0 explicitly per strategy via a separate map and uses it as the stable representative for delta computation, instead of relying on List.get(0). 4. stddev computed from rounded mean — introduced rawMean() as a private helper that returns the unrounded average; stddev() now uses rawMean() for the variance calculation and only rounds the final result. mean() still returns a rounded value for presentation. Tests unchanged — they test mean() and stddev() at their public contracts.

Copilot AI review requested due to automatic review settings April 27, 2026 00:16

Copilot started reviewing on behalf of PSchmitz-Valckenberg April 27, 2026 00:16 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread scripts/run_benchmark.sh Outdated

Comment thread scripts/run_benchmark.sh

Comment thread src/main/java/com/sentinelcore/service/BenchmarkService.java Outdated

Comment thread src/main/java/com/sentinelcore/service/BenchmarkService.java Outdated

PSchmitz-Valckenberg merged commit cd15be3 into main Apr 27, 2026
1 check passed

PSchmitz-Valckenberg deleted the feat/step-17-repetitions-ci branch April 27, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Step 17 — repetitions and confidence intervals for benchmarks#17

feat: Step 17 — repetitions and confidence intervals for benchmarks#17
PSchmitz-Valckenberg merged 2 commits into
mainfrom
feat/step-17-repetitions-ci

PSchmitz-Valckenberg commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PSchmitz-Valckenberg commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants