Skip to content

feat: Step 17 — repetitions and confidence intervals for benchmarks#17

Merged
PSchmitz-Valckenberg merged 2 commits into
mainfrom
feat/step-17-repetitions-ci
Apr 27, 2026
Merged

feat: Step 17 — repetitions and confidence intervals for benchmarks#17
PSchmitz-Valckenberg merged 2 commits into
mainfrom
feat/step-17-repetitions-ci

Conversation

@PSchmitz-Valckenberg

Copy link
Copy Markdown
Owner

Adds statistical rigor to the benchmark pipeline. Each strategy can now be run N times; the report exposes mean and stddev per metric so noise-floor effects (e.g. 10% attack-success delta on N=1) are distinguishable from real differences.

Schema (V6):

  • benchmarks.repetitions INTEGER NOT NULL DEFAULT 1
  • benchmark_runs PK changed from (benchmark_id, strategy_type) to (benchmark_id, strategy_type, repetition_index) — allows N rows per strategy per benchmark

API:

  • BenchmarkCreateRequest: optional repetitions field (1–10, default 1)
  • BenchmarkReportResponse: includes repetitions count
  • RunComparisonEntry: runId → runIds list, plus new aggregated field
  • AggregatedStrategyMetrics: mean + stddev (null when N=1) for attackSuccessRate, falsePositiveRate, refusalRate, avgLatencyMs

Service:

  • BenchmarkService.executeBenchmark: inner loop runs each strategy N times, recording a BenchmarkRun per repetition
  • BenchmarkService.getReport: groups runs by strategy, aggregates using population stddev; delta vs baseline uses first NONE run as representative (unchanged semantics for N=1)
  • mean/stddev helpers are package-visible statics for unit testing

Script:

  • --repetitions N flag (default 3); summary table shows mean + stddev columns; N=1 stddev shown as "n/a"

Tests:

  • 8 new unit tests covering mean, stddev edge cases, N=1 null stddev, N=3 aggregation with known values

Adds statistical rigor to the benchmark pipeline. Each strategy can
now be run N times; the report exposes mean and stddev per metric so
noise-floor effects (e.g. 10% attack-success delta on N=1) are
distinguishable from real differences.

Schema (V6):
- benchmarks.repetitions INTEGER NOT NULL DEFAULT 1
- benchmark_runs PK changed from (benchmark_id, strategy_type) to
  (benchmark_id, strategy_type, repetition_index) — allows N rows
  per strategy per benchmark

API:
- BenchmarkCreateRequest: optional repetitions field (1–10, default 1)
- BenchmarkReportResponse: includes repetitions count
- RunComparisonEntry: runId → runIds list, plus new aggregated field
- AggregatedStrategyMetrics: mean + stddev (null when N=1) for
  attackSuccessRate, falsePositiveRate, refusalRate, avgLatencyMs

Service:
- BenchmarkService.executeBenchmark: inner loop runs each strategy
  N times, recording a BenchmarkRun per repetition
- BenchmarkService.getReport: groups runs by strategy, aggregates
  using population stddev; delta vs baseline uses first NONE run
  as representative (unchanged semantics for N=1)
- mean/stddev helpers are package-visible statics for unit testing

Script:
- --repetitions N flag (default 3); summary table shows mean + stddev
  columns; N=1 stddev shown as "n/a"

Tests:
- 8 new unit tests covering mean, stddev edge cases, N=1 null stddev,
  N=3 aggregation with known values
Copilot AI review requested due to automatic review settings April 27, 2026 00:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds repetitions and aggregated statistics (mean/stddev) to the benchmark pipeline so benchmark comparisons can distinguish noise from real strategy deltas.

Changes:

  • Introduces benchmarks.repetitions and updates benchmark_runs PK to include repetition_index (V6 migration).
  • Updates benchmark execution/reporting to run each strategy N times and return per-strategy aggregated metrics in the report API.
  • Updates CLI script output formatting and adds unit tests for mean/stddev/aggregation helpers.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/test/java/com/sentinelcore/service/BenchmarkStatisticsTest.java Adds unit tests for mean/stddev and aggregation behavior (including N=1 null stddev).
src/main/resources/db/migration/V6__add_repetitions.sql Adds benchmarks.repetitions and extends benchmark_runs PK with repetition_index.
src/main/java/com/sentinelcore/service/BenchmarkService.java Executes each strategy multiple times; groups runs by strategy; aggregates metrics; exposes mean/stddev helpers.
src/main/java/com/sentinelcore/dto/RunComparisonEntry.java Changes runId to runIds and adds aggregated metrics payload.
src/main/java/com/sentinelcore/dto/BenchmarkReportResponse.java Includes repetitions in the report response.
src/main/java/com/sentinelcore/dto/BenchmarkCreateRequest.java Adds optional validated repetitions field and a defaulting helper.
src/main/java/com/sentinelcore/dto/AggregatedStrategyMetrics.java New DTO carrying mean/stddev per metric (stddev nullable for N=1).
src/main/java/com/sentinelcore/domain/entity/BenchmarkRun.java Adds repetitionIndex to persist repetition identity per run.
src/main/java/com/sentinelcore/domain/entity/Benchmark.java Adds persisted repetitions field (default 1).
src/main/java/com/sentinelcore/controller/BenchmarkController.java Passes repetitions through to BenchmarkService.createBenchmark.
scripts/run_benchmark.sh Adds --repetitions flag and updates summary output to use aggregated fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/run_benchmark.sh Outdated
Comment thread scripts/run_benchmark.sh
Comment thread src/main/java/com/sentinelcore/service/BenchmarkService.java Outdated
Comment thread src/main/java/com/sentinelcore/service/BenchmarkService.java Outdated
Four issues from PR #17 review:

1. Script LABEL default mismatched with usage examples — changed
   default from gemini-2.0-flash to gemini-2.5-flash to align with
   the documented examples.

2. Script showed stddev only for ASR — summary table now split into
   two sections (mean / stddev) so all four metrics (ASR, FPR,
   refusal, latency) show both values. stddev prints "null" when N=1.

3. BenchmarkRun list retrieval order not guaranteed — added
   @orderby("repetitionIndex ASC") on Benchmark.runs so the list is
   always in insertion order from the DB. getReport() now tracks rep-0
   explicitly per strategy via a separate map and uses it as the
   stable representative for delta computation, instead of relying on
   List.get(0).

4. stddev computed from rounded mean — introduced rawMean() as a
   private helper that returns the unrounded average; stddev() now
   uses rawMean() for the variance calculation and only rounds the
   final result. mean() still returns a rounded value for presentation.
   Tests unchanged — they test mean() and stddev() at their public
   contracts.
@PSchmitz-Valckenberg PSchmitz-Valckenberg merged commit cd15be3 into main Apr 27, 2026
1 check passed
@PSchmitz-Valckenberg PSchmitz-Valckenberg deleted the feat/step-17-repetitions-ci branch April 27, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants