Skip to content

Commit b713ad6

Browse files
committed
audit issue #23: measurement integrity and transparency
Core changes: - Make verify_fn a hard gate: runners now skip timed measurement when verification fails, including immediate-mode cases. - Add raw + cleaned timing stats in TimingStats/BenchmarkStats; expose raw_mean/median/stddev/cv/sample_count and outliers_removed in JSON, CSV, and Markdown reports. - New CLI flags --no-outlier-removal and --include-unstable-in-scores. - Exclude high-CV results from composite scores by default; note the count in Markdown reports. - Raise max_retries default from 0 to 1. - Add single-thread default warning in both binaries. - Document vx_perf median caveat as median_is_avg_approximation in JSON. - Add OpenCV comparison framing note to compareReports output. - Add scripts/check_report.py and wire it into CI smoke jobs. Refs: #23
1 parent 029077a commit b713ad6

12 files changed

Lines changed: 371 additions & 36 deletions

File tree

.github/workflows/ci.yml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,17 @@ jobs:
224224
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
225225
--output-dir smoke-results
226226
227+
- name: Verify MIVisionX smoke report
228+
if: always()
229+
continue-on-error: true
230+
run: |
231+
set -euo pipefail
232+
cd build-smoke
233+
python3 ../scripts/check_report.py \
234+
smoke-results/benchmark_results.json \
235+
--allow-feature-set vision,framework \
236+
|| echo "Verification check failed — see log for unsupported/unverified cases"
237+
227238
- name: Upload MIVisionX artifact
228239
if: always()
229240
uses: actions/upload-artifact@v4
@@ -357,6 +368,17 @@ jobs:
357368
smoke-results-extra/benchmark_results.json \
358369
--output smoke-results/benchmark_results.json
359370
371+
- name: Verify Khronos smoke report (vision + framework only)
372+
if: always()
373+
continue-on-error: true
374+
run: |
375+
set -euo pipefail
376+
cd build-smoke
377+
python3 ../scripts/check_report.py \
378+
smoke-results/benchmark_results.json \
379+
--allow-feature-set vision,framework \
380+
|| echo "Verification check failed — see log for unsupported/unverified cases"
381+
360382
- name: Upload Khronos sample artifact
361383
if: always()
362384
uses: actions/upload-artifact@v4
@@ -493,6 +515,17 @@ jobs:
493515
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
494516
--output-dir smoke-results
495517
518+
- name: Verify rustVX smoke report
519+
if: always()
520+
continue-on-error: true
521+
run: |
522+
set -euo pipefail
523+
cd build-smoke
524+
python3 ../scripts/check_report.py \
525+
smoke-results/benchmark_results.json \
526+
--allow-feature-set vision,enhanced_vision,framework \
527+
|| echo "Verification check failed — see log for unsupported/unverified cases"
528+
496529
- name: Upload rustVX artifact
497530
if: always()
498531
uses: actions/upload-artifact@v4
@@ -595,6 +628,16 @@ jobs:
595628
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
596629
--output-dir smoke-results
597630
631+
- name: Verify OpenCV smoke report
632+
if: always()
633+
continue-on-error: true
634+
run: |
635+
set -euo pipefail
636+
cd build-opencv
637+
python3 ../scripts/check_report.py \
638+
smoke-results/benchmark_results.json \
639+
|| echo "Verification check failed — see log for unsupported/unverified cases"
640+
598641
- name: Upload opencv-mark smoke results
599642
if: always()
600643
uses: actions/upload-artifact@v4

CHANGELOG.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,84 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66

77
## [Unreleased]
88

9+
### Fixed — verification is now a hard gate before measurement
10+
11+
`verify_fn` was called after the warmup loop but before the timed
12+
measurement loop. If verification failed, the runner only set
13+
`verified = false` and then proceeded to burn full measurement time
14+
and emit possibly-garbage timing numbers.
15+
16+
Both `src/benchmark_runner.cpp` and `opencv-mark/src/opencv_runner.cpp`
17+
now return immediately after a verification failure, skipping the
18+
timed loop entirely. `verify_fn` is also invoked for immediate-mode
19+
cases where it was previously ignored. This prevents unverified
20+
kernels from influencing composite scores and avoids wasting time on
21+
known-bad configurations.
22+
23+
### Changed — transparent statistics: raw + cleaned timing data
24+
25+
`TimingStats` now carries both cleaned (headline) and raw
26+
(unfiltered) statistics. The shared `BenchmarkStats::compute()`
27+
function reports:
28+
29+
- cleaned `mean/median/stddev/cv` (with IQR outlier removal, the default)
30+
- raw `mean/median/stddev/cv/sample_count` for comparison
31+
- `outliers_removed` count
32+
33+
JSON, CSV, and Markdown reports expose the raw fields. The Markdown
34+
config section now states whether IQR outlier removal is enabled,
35+
and the glossary explains how headline stats are derived.
36+
37+
New CLI flags:
38+
39+
- `--no-outlier-removal` — use raw samples for headline stats
40+
- `--include-unstable-in-scores` — keep high-CV results in composite scores
41+
42+
### Changed — unstable results excluded from composite scores by default
43+
44+
The Vision Score and Enhanced Vision Score geometric means previously
45+
included every passing graph-mode benchmark, even if its CV% was far
46+
above the stability threshold. A single noisy kernel could materially
47+
distort the headline number.
48+
49+
`BenchmarkReport::computeScores()` now skips benchmarks with
50+
`stability_warning == true` when `exclude_unstable_from_scores` is
51+
true (the new default). The Markdown report notes how many benchmarks
52+
were excluded and how to opt back in.
53+
54+
### Changed — default `max_retries` raised from 0 to 1
55+
56+
A single retry with 2x iterations often stabilizes measurements on
57+
noisy CI runners at negligible cost. The new default gives one
58+
auto-retry before flagging a result unstable.
59+
60+
### Added — threading default warning in startup banner
61+
62+
When `--threads` is left at the default `1`, both binaries now print a
63+
one-line reminder that the run is pinned to a single thread for
64+
cross-implementation parity and how to restore library defaults.
65+
66+
### Added — `vx_perf` median caveat documented in JSON
67+
68+
`vx_perf_t` has no true median field, so the runner approximated it
69+
from `avg`. JSON reports now emit `"median_is_avg_approximation": true`
70+
inside the `vx_perf` object, and the Markdown glossary explains the
71+
limitation.
72+
73+
### Added — OpenCV comparison framing note
74+
75+
`BenchmarkReport::compareReports()` now prefixes generated comparison
76+
reports with a short block explaining that the comparison is OpenVX
77+
graph-mode vs sequential OpenCV, single-threaded by default, and that
78+
speedup values are OpenVX/OpenCV throughput ratios.
79+
80+
### Added — CI smoke verification check
81+
82+
A new `scripts/check_report.py` utility parses the generated JSON and
83+
fails (or warns) if any benchmark is unsupported or unverified. Each
84+
Phase-1 smoke job now invokes it so verification regressions are caught
85+
before the slower Phase 2 comparison job runs.
86+
987
### Fixed — Khronos sample compatibility (verify_fns + CI split-and-merge)
1088

1189
Three Khronos OpenVX-sample-impl issues surfaced once rustVX was

include/benchmark_config.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,11 @@ struct BenchmarkConfig {
7979

8080
// Stability gating
8181
double stability_threshold = 15.0; // CV% threshold for stability warning
82-
int max_retries = 0; // 0 = no retries
82+
int max_retries = 1; // 1 = one retry if CV% exceeds threshold
83+
84+
// Statistics policy
85+
bool remove_outliers = true; // IQR outlier removal for headline timing stats
86+
bool exclude_unstable_from_scores = true; // exclude high-CV results from composite scores
8387

8488
// Threading policy — applied early in main() before any kernel runs.
8589
// 0 → leave the impl/library default in place (OpenCV: nproc;

include/benchmark_report.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,9 @@ class BenchmarkReport {
6868
void writeCSV(const std::vector<BenchmarkResult>& results, const std::string& path);
6969
void writeMarkdown(const std::vector<BenchmarkResult>& results, const std::string& path);
7070

71-
// Analytics
72-
static CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
71+
// Analytics. computeScores now reads the report's config to decide whether
72+
// unstable (high-CV) results are included in composite scores.
73+
CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
7374
static std::vector<ScalingEntry> computeScaling(const std::vector<BenchmarkResult>& results);
7475
static std::vector<ConformanceResult> checkConformance(const std::vector<BenchmarkResult>& results,
7576
const BenchmarkCatalog& catalog);

include/benchmark_stats.h

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
#include <vector>
77

88
struct TimingStats {
9+
// Headline metrics are computed from the IQR-cleaned sample set when
10+
// outlier removal is enabled (the default). Raw counterparts are kept
11+
// so readers can see how much the cleaning step moved the numbers.
912
double mean_ns = 0;
1013
double median_ns = 0;
1114
double min_ns = 0;
@@ -14,9 +17,16 @@ struct TimingStats {
1417
double p5_ns = 0;
1518
double p95_ns = 0;
1619
double p99_ns = 0;
17-
double cv_percent = 0; // coefficient of variation
20+
double cv_percent = 0; // coefficient of variation (cleaned)
1821
size_t sample_count = 0;
1922
size_t outliers_removed = 0;
23+
24+
// Raw (unfiltered) statistics computed before IQR cleaning.
25+
double raw_mean_ns = 0;
26+
double raw_median_ns = 0;
27+
double raw_stddev_ns = 0;
28+
double raw_cv_percent = 0;
29+
size_t raw_sample_count = 0;
2030
};
2131

2232
// A named scalar metric emitted by a framework benchmark.
@@ -60,8 +70,11 @@ struct BenchmarkResult {
6070

6171
class BenchmarkStats {
6272
public:
63-
// Compute statistics from raw timing samples (in nanoseconds)
64-
static TimingStats compute(const std::vector<double>& samples_ns);
73+
// Compute statistics from raw timing samples (in nanoseconds).
74+
// When remove_outliers is true (default), headline fields are IQR-cleaned
75+
// and raw_* fields retain the unfiltered values.
76+
static TimingStats compute(const std::vector<double>& samples_ns,
77+
bool remove_outliers = true);
6578

6679
// Compute throughput in megapixels/sec
6780
static double computeThroughput(uint32_t width, uint32_t height, double median_ns);

opencv-mark/src/main.cpp

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@ void printUsage(const char* prog) {
5858
printf(" --warmup N Warm-up iterations (default: 10)\n");
5959
printf(" --seed N PRNG seed (default: 42)\n");
6060
printf(" --stability-threshold N CV%% threshold (default: 15)\n");
61-
printf(" --max-retries N Max retries for unstable benchmarks (default: 0)\n\n");
61+
printf(" --max-retries N Max retries for unstable benchmarks (default: 1)\n");
62+
printf(" --no-outlier-removal Use raw samples for headline mean/median/stddev/CV\n");
63+
printf(" --include-unstable-in-scores Include high-CV results in composite scores\n\n");
6264

6365
printf("Output:\n");
6466
printf(" --output-dir DIR Output directory (default: ./benchmark_results)\n");
@@ -184,6 +186,10 @@ bool parseArgs(int argc, char* argv[], BenchmarkConfig& config) {
184186
config.stability_threshold = atof(argv[++i]);
185187
} else if (arg == "--max-retries" && i + 1 < argc) {
186188
config.max_retries = atoi(argv[++i]);
189+
} else if (arg == "--no-outlier-removal") {
190+
config.remove_outliers = false;
191+
} else if (arg == "--include-unstable-in-scores") {
192+
config.exclude_unstable_from_scores = false;
187193
} else if (arg == "--compare" && i + 1 < argc) {
188194
config.compare_files = splitComma(argv[++i]);
189195
} else if (arg == "--threads" && i + 1 < argc) {
@@ -316,7 +322,12 @@ int main(int argc, char* argv[]) {
316322
config.resolutions[i].width, config.resolutions[i].height);
317323
}
318324
printf("\n Iterations: %d (warmup %d)\n", config.iterations, config.warmup);
319-
printf(" Mode: graph (single cv:: call on pre-allocated buffers)\n\n");
325+
printf(" Mode: graph (single cv:: call on pre-allocated buffers)\n");
326+
if (config.threads == 1) {
327+
printf(" Threading: pinned to 1 thread for cross-impl parity; "
328+
"use --threads 0 for OpenCV default (nproc) behavior\n");
329+
}
330+
printf("\n");
320331

321332
opencv_mark::OpenCVRunner runner(config);
322333
runner.addCases(opencv_mark::registerCvFilterBenchmarks());
@@ -367,7 +378,7 @@ int main(int argc, char* argv[]) {
367378
printf("\n=============================================================\n");
368379
printf(" Summary: %d total | %d passed | %d skipped | %d failed\n",
369380
total, passed, skipped, failed);
370-
auto scores = BenchmarkReport::computeScores(results);
381+
auto scores = report.computeScores(results);
371382
if (scores.vision_count > 0) {
372383
printf(" OpenCV Vision Score: %.2f MP/s (%d benchmarks)\n",
373384
scores.overall_vision_score, scores.vision_count);

opencv-mark/src/opencv_runner.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
122122
if (!ok) {
123123
result.verified = false;
124124
result.skip_reason = "output verification failed";
125+
return result;
125126
}
126127
}
127128

@@ -142,7 +143,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
142143
return result;
143144
}
144145

145-
result.wall_clock = BenchmarkStats::compute(samples);
146+
result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
146147
result.megapixels_per_sec = BenchmarkStats::computeThroughput(
147148
res.width, res.height, result.wall_clock.median_ns);
148149

@@ -163,7 +164,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
163164
timer.stop();
164165
samples.push_back(timer.elapsed_ns());
165166
}
166-
result.wall_clock = BenchmarkStats::compute(samples);
167+
result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
167168
result.megapixels_per_sec = BenchmarkStats::computeThroughput(
168169
res.width, res.height, result.wall_clock.median_ns);
169170
result.iterations = current_iters;

scripts/check_report.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/usr/bin/env python3
2+
"""Check an openvx-mark/opencv-mark JSON report for failures.
3+
4+
Returns non-zero if any benchmark result in the report is unsupported
5+
or unverified. Use --allow-feature-set to scope the check to specific
6+
feature sets (e.g. "vision,framework"), or --warn-only to print a
7+
summary without failing.
8+
"""
9+
10+
import argparse
11+
import json
12+
import sys
13+
14+
15+
def load_report(path):
16+
with open(path, "r") as f:
17+
return json.load(f)
18+
19+
20+
def main():
21+
parser = argparse.ArgumentParser(description="Verify benchmark report integrity")
22+
parser.add_argument("report", help="Path to benchmark_results.json")
23+
parser.add_argument(
24+
"--allow-feature-set",
25+
type=str,
26+
default="",
27+
help="Comma-separated feature sets to check (default: all)",
28+
)
29+
parser.add_argument(
30+
"--warn-only",
31+
action="store_true",
32+
help="Print summary but do not exit with failure",
33+
)
34+
args = parser.parse_args()
35+
36+
report = load_report(args.report)
37+
results = report.get("results", [])
38+
39+
allowed_sets = set()
40+
if args.allow_feature_set:
41+
allowed_sets = {s.strip() for s in args.allow_feature_set.split(",")}
42+
43+
unsupported = []
44+
unverified = []
45+
46+
for r in results:
47+
if allowed_sets and r.get("feature_set") not in allowed_sets:
48+
continue
49+
if not r.get("supported", True):
50+
unsupported.append(r)
51+
elif not r.get("verified", True):
52+
unverified.append(r)
53+
54+
total_checked = len(
55+
[r for r in results if not allowed_sets or r.get("feature_set") in allowed_sets]
56+
)
57+
58+
print(
59+
f"check_report: {total_checked} result(s) checked, "
60+
f"{len(unsupported)} unsupported, {len(unverified)} unverified"
61+
)
62+
63+
for r in unsupported[:5]:
64+
print(
65+
f" UNSUPPORTED: {r.get('name')} @ {r.get('resolution')} "
66+
f"({r.get('mode')}) — {r.get('skip_reason', 'no reason')}"
67+
)
68+
if len(unsupported) > 5:
69+
print(f" ... and {len(unsupported) - 5} more unsupported")
70+
71+
for r in unverified[:5]:
72+
print(
73+
f" UNVERIFIED: {r.get('name')} @ {r.get('resolution')} "
74+
f"({r.get('mode')}) — {r.get('skip_reason', 'output verification failed')}"
75+
)
76+
if len(unverified) > 5:
77+
print(f" ... and {len(unverified) - 5} more unverified")
78+
79+
if (unsupported or unverified) and not args.warn_only:
80+
sys.exit(1)
81+
82+
83+
if __name__ == "__main__":
84+
main()

0 commit comments

Comments
 (0)