feat: Step 17 — repetitions and confidence intervals for benchmarks#17
Merged
Conversation
Adds statistical rigor to the benchmark pipeline. Each strategy can now be run N times; the report exposes mean and stddev per metric so noise-floor effects (e.g. 10% attack-success delta on N=1) are distinguishable from real differences. Schema (V6): - benchmarks.repetitions INTEGER NOT NULL DEFAULT 1 - benchmark_runs PK changed from (benchmark_id, strategy_type) to (benchmark_id, strategy_type, repetition_index) — allows N rows per strategy per benchmark API: - BenchmarkCreateRequest: optional repetitions field (1–10, default 1) - BenchmarkReportResponse: includes repetitions count - RunComparisonEntry: runId → runIds list, plus new aggregated field - AggregatedStrategyMetrics: mean + stddev (null when N=1) for attackSuccessRate, falsePositiveRate, refusalRate, avgLatencyMs Service: - BenchmarkService.executeBenchmark: inner loop runs each strategy N times, recording a BenchmarkRun per repetition - BenchmarkService.getReport: groups runs by strategy, aggregates using population stddev; delta vs baseline uses first NONE run as representative (unchanged semantics for N=1) - mean/stddev helpers are package-visible statics for unit testing Script: - --repetitions N flag (default 3); summary table shows mean + stddev columns; N=1 stddev shown as "n/a" Tests: - 8 new unit tests covering mean, stddev edge cases, N=1 null stddev, N=3 aggregation with known values
Contributor
There was a problem hiding this comment.
Pull request overview
Adds repetitions and aggregated statistics (mean/stddev) to the benchmark pipeline so benchmark comparisons can distinguish noise from real strategy deltas.
Changes:
- Introduces
benchmarks.repetitionsand updatesbenchmark_runsPK to includerepetition_index(V6 migration). - Updates benchmark execution/reporting to run each strategy N times and return per-strategy aggregated metrics in the report API.
- Updates CLI script output formatting and adds unit tests for mean/stddev/aggregation helpers.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/test/java/com/sentinelcore/service/BenchmarkStatisticsTest.java | Adds unit tests for mean/stddev and aggregation behavior (including N=1 null stddev). |
| src/main/resources/db/migration/V6__add_repetitions.sql | Adds benchmarks.repetitions and extends benchmark_runs PK with repetition_index. |
| src/main/java/com/sentinelcore/service/BenchmarkService.java | Executes each strategy multiple times; groups runs by strategy; aggregates metrics; exposes mean/stddev helpers. |
| src/main/java/com/sentinelcore/dto/RunComparisonEntry.java | Changes runId to runIds and adds aggregated metrics payload. |
| src/main/java/com/sentinelcore/dto/BenchmarkReportResponse.java | Includes repetitions in the report response. |
| src/main/java/com/sentinelcore/dto/BenchmarkCreateRequest.java | Adds optional validated repetitions field and a defaulting helper. |
| src/main/java/com/sentinelcore/dto/AggregatedStrategyMetrics.java | New DTO carrying mean/stddev per metric (stddev nullable for N=1). |
| src/main/java/com/sentinelcore/domain/entity/BenchmarkRun.java | Adds repetitionIndex to persist repetition identity per run. |
| src/main/java/com/sentinelcore/domain/entity/Benchmark.java | Adds persisted repetitions field (default 1). |
| src/main/java/com/sentinelcore/controller/BenchmarkController.java | Passes repetitions through to BenchmarkService.createBenchmark. |
| scripts/run_benchmark.sh | Adds --repetitions flag and updates summary output to use aggregated fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four issues from PR #17 review: 1. Script LABEL default mismatched with usage examples — changed default from gemini-2.0-flash to gemini-2.5-flash to align with the documented examples. 2. Script showed stddev only for ASR — summary table now split into two sections (mean / stddev) so all four metrics (ASR, FPR, refusal, latency) show both values. stddev prints "null" when N=1. 3. BenchmarkRun list retrieval order not guaranteed — added @orderby("repetitionIndex ASC") on Benchmark.runs so the list is always in insertion order from the DB. getReport() now tracks rep-0 explicitly per strategy via a separate map and uses it as the stable representative for delta computation, instead of relying on List.get(0). 4. stddev computed from rounded mean — introduced rawMean() as a private helper that returns the unrounded average; stddev() now uses rawMean() for the variance calculation and only rounds the final result. mean() still returns a rounded value for presentation. Tests unchanged — they test mean() and stddev() at their public contracts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds statistical rigor to the benchmark pipeline. Each strategy can now be run N times; the report exposes mean and stddev per metric so noise-floor effects (e.g. 10% attack-success delta on N=1) are distinguishable from real differences.
Schema (V6):
API:
Service:
Script:
Tests: