Skip to content

Commit 6388c02

Browse files
authored
Enhancement: measurement integrity and transparency (#26)
Co-authored-by: simonCatBot <simoncatbot@users.noreply.github.com>
1 parent c97d894 commit 6388c02

12 files changed

Lines changed: 387 additions & 36 deletions

File tree

.github/workflows/ci.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -224,6 +224,15 @@ jobs:
224224
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
225225
--output-dir smoke-results
226226
227+
- name: Verify MIVisionX smoke report
228+
if: always()
229+
run: |
230+
set -euo pipefail
231+
cd build-smoke
232+
python3 ../scripts/check_report.py \
233+
smoke-results/benchmark_results.json \
234+
--allow-feature-set vision,framework
235+
227236
- name: Upload MIVisionX artifact
228237
if: always()
229238
uses: actions/upload-artifact@v4
@@ -357,6 +366,21 @@ jobs:
357366
smoke-results-extra/benchmark_results.json \
358367
--output smoke-results/benchmark_results.json
359368
369+
- name: Verify Khronos smoke report (vision + framework only)
370+
if: always()
371+
run: |
372+
set -euo pipefail
373+
cd build-smoke
374+
# The Khronos reference sample has known vision-profile gaps
375+
# (e.g. S16 LaplacianPyramid is rejected at vxVerifyGraph with
376+
# VX_ERROR_INVALID_PARAMETERS), so unsupported rows are expected
377+
# here. Use --warn-only so the smoke still surfaces the count
378+
# without failing the job for impl-side limitations.
379+
python3 ../scripts/check_report.py \
380+
smoke-results/benchmark_results.json \
381+
--allow-feature-set vision,framework \
382+
--warn-only
383+
360384
- name: Upload Khronos sample artifact
361385
if: always()
362386
uses: actions/upload-artifact@v4
@@ -493,6 +517,15 @@ jobs:
493517
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
494518
--output-dir smoke-results
495519
520+
- name: Verify rustVX smoke report
521+
if: always()
522+
run: |
523+
set -euo pipefail
524+
cd build-smoke
525+
python3 ../scripts/check_report.py \
526+
smoke-results/benchmark_results.json \
527+
--allow-feature-set vision,enhanced_vision,framework
528+
496529
- name: Upload rustVX artifact
497530
if: always()
498531
uses: actions/upload-artifact@v4
@@ -595,6 +628,14 @@ jobs:
595628
--resolution VGA --iterations 5 --warmup 1 --threads 1 \
596629
--output-dir smoke-results
597630
631+
- name: Verify OpenCV smoke report
632+
if: always()
633+
run: |
634+
set -euo pipefail
635+
cd build-opencv
636+
python3 ../scripts/check_report.py \
637+
smoke-results/benchmark_results.json
638+
598639
- name: Upload opencv-mark smoke results
599640
if: always()
600641
uses: actions/upload-artifact@v4

CHANGELOG.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,100 @@ All notable changes to **openvx-mark** are documented here.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project follows semantic versioning where the major version tracks backward compatibility of the JSON report schema.
66

77
## [Unreleased]
8+
### Breaking — JSON/CSV report schema expanded
9+
10+
This release adds transparency fields to timing reports. Downstream
11+
parsers that read CSV by fixed column index or JSON by a strict field
12+
list will need updating:
13+
14+
- `TimingStats` now exposes `raw_mean_ms`, `raw_median_ms`,
15+
`raw_stddev_ms`, `raw_cv_percent`, and `raw_sample_count` in both
16+
JSON and CSV.
17+
- JSON `config` gains `remove_outliers` and `exclude_unstable_from_scores`.
18+
- JSON `vx_perf` gains `median_is_avg_approximation`.
19+
- The CSV header grew by 5 columns (`raw_*`) and the skipped-row comma
20+
count was adjusted accordingly.
21+
22+
### Fixed — verification is now a hard gate before measurement
23+
24+
`verify_fn` was called after the warmup loop but before the timed
25+
measurement loop. If verification failed, the runner only set
26+
`verified = false` and then proceeded to burn full measurement time
27+
and emit possibly-garbage timing numbers.
28+
29+
Both `src/benchmark_runner.cpp` and `opencv-mark/src/opencv_runner.cpp`
30+
now return immediately after a verification failure, skipping the
31+
timed loop entirely. `verify_fn` is also invoked for immediate-mode
32+
cases where it was previously ignored. This prevents unverified
33+
kernels from influencing composite scores and avoids wasting time on
34+
known-bad configurations.
35+
36+
### Changed — transparent statistics: raw + cleaned timing data
37+
38+
`TimingStats` now carries both cleaned (headline) and raw
39+
(unfiltered) statistics. The shared `BenchmarkStats::compute()`
40+
function reports:
41+
42+
- cleaned `mean/median/stddev/cv` (with IQR outlier removal, the default)
43+
- raw `mean/median/stddev/cv/sample_count` for comparison
44+
- `outliers_removed` count
45+
46+
JSON, CSV, and Markdown reports expose the raw fields. The Markdown
47+
config section now states whether IQR outlier removal is enabled,
48+
and the glossary explains how headline stats are derived.
49+
50+
New CLI flags:
51+
52+
- `--no-outlier-removal` — use raw samples for headline stats
53+
- `--include-unstable-in-scores` — keep high-CV results in composite scores
54+
55+
### Changed — unstable results excluded from composite scores by default
56+
57+
The Vision Score and Enhanced Vision Score geometric means previously
58+
included every passing graph-mode benchmark, even if its CV% was far
59+
above the stability threshold. A single noisy kernel could materially
60+
distort the headline number.
61+
62+
`BenchmarkReport::computeScores()` now skips benchmarks with
63+
`stability_warning == true` when `exclude_unstable_from_scores` is
64+
true (the new default). The Markdown report notes how many benchmarks
65+
were excluded and how to opt back in.
66+
67+
### Changed — default `max_retries` raised from 0 to 1
68+
69+
A single retry with 2x iterations often stabilizes measurements on
70+
noisy CI runners at negligible cost. The new default gives one
71+
auto-retry before flagging a result unstable.
72+
73+
### Added — threading default warning in startup banner
74+
75+
When `--threads` is left at the default `1`, both binaries now print a
76+
one-line reminder that the run is pinned to a single thread for
77+
cross-implementation parity and how to restore library defaults.
78+
79+
### Added — `vx_perf` median caveat documented in JSON
80+
81+
`vx_perf_t` has no true median field, so the runner approximated it
82+
from `avg`. JSON reports now emit `"median_is_avg_approximation": true`
83+
inside the `vx_perf` object, and the Markdown glossary explains the
84+
limitation.
85+
86+
### Added — OpenCV comparison framing note
87+
88+
`BenchmarkReport::compareReports()` now prefixes generated comparison
89+
reports with a short block explaining that the comparison is OpenVX
90+
graph-mode vs sequential OpenCV, single-threaded by default, and that
91+
speedup values are OpenVX/OpenCV throughput ratios.
92+
93+
### Added — CI smoke verification check
94+
95+
A new `scripts/check_report.py` utility parses the generated JSON and
96+
fails if any benchmark in the checked scope is unsupported or
97+
unverified. Each Phase-1 smoke job now invokes it so verification
98+
regressions are caught before the slower Phase 2 comparison job runs.
99+
Use `--warn-only` for a non-failing audit or `--allow-feature-set` to
100+
limit the checked scope.
101+
8102

9103
### Fixed — opencv-mark duplicate `OpticalFlowPyrLK` benchmark name
10104

include/benchmark_config.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,11 @@ struct BenchmarkConfig {
7979

8080
// Stability gating
8181
double stability_threshold = 15.0; // CV% threshold for stability warning
82-
int max_retries = 0; // 0 = no retries
82+
int max_retries = 1; // 1 = one retry if CV% exceeds threshold
83+
84+
// Statistics policy
85+
bool remove_outliers = true; // IQR outlier removal for headline timing stats
86+
bool exclude_unstable_from_scores = true; // exclude high-CV results from composite scores
8387

8488
// Threading policy — applied early in main() before any kernel runs.
8589
// 0 → leave the impl/library default in place (OpenCV: nproc;

include/benchmark_report.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,8 +68,9 @@ class BenchmarkReport {
6868
void writeCSV(const std::vector<BenchmarkResult>& results, const std::string& path);
6969
void writeMarkdown(const std::vector<BenchmarkResult>& results, const std::string& path);
7070

71-
// Analytics
72-
static CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
71+
// Analytics. computeScores now reads the report's config to decide whether
72+
// unstable (high-CV) results are included in composite scores.
73+
CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
7374
static std::vector<ScalingEntry> computeScaling(const std::vector<BenchmarkResult>& results);
7475
static std::vector<ConformanceResult> checkConformance(const std::vector<BenchmarkResult>& results,
7576
const BenchmarkCatalog& catalog);

include/benchmark_stats.h

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
#include <vector>
77

88
struct TimingStats {
9+
// Headline metrics are computed from the IQR-cleaned sample set when
10+
// outlier removal is enabled (the default). Raw counterparts are kept
11+
// so readers can see how much the cleaning step moved the numbers.
912
double mean_ns = 0;
1013
double median_ns = 0;
1114
double min_ns = 0;
@@ -14,9 +17,16 @@ struct TimingStats {
1417
double p5_ns = 0;
1518
double p95_ns = 0;
1619
double p99_ns = 0;
17-
double cv_percent = 0; // coefficient of variation
20+
double cv_percent = 0; // coefficient of variation (cleaned)
1821
size_t sample_count = 0;
1922
size_t outliers_removed = 0;
23+
24+
// Raw (unfiltered) statistics computed before IQR cleaning.
25+
double raw_mean_ns = 0;
26+
double raw_median_ns = 0;
27+
double raw_stddev_ns = 0;
28+
double raw_cv_percent = 0;
29+
size_t raw_sample_count = 0;
2030
};
2131

2232
// A named scalar metric emitted by a framework benchmark.
@@ -60,8 +70,11 @@ struct BenchmarkResult {
6070

6171
class BenchmarkStats {
6272
public:
63-
// Compute statistics from raw timing samples (in nanoseconds)
64-
static TimingStats compute(const std::vector<double>& samples_ns);
73+
// Compute statistics from raw timing samples (in nanoseconds).
74+
// When remove_outliers is true (default), headline fields are IQR-cleaned
75+
// and raw_* fields retain the unfiltered values.
76+
static TimingStats compute(const std::vector<double>& samples_ns,
77+
bool remove_outliers = true);
6578

6679
// Compute throughput in megapixels/sec
6780
static double computeThroughput(uint32_t width, uint32_t height, double median_ns);

opencv-mark/src/main.cpp

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,9 @@ void printUsage(const char* prog) {
5858
printf(" --warmup N Warm-up iterations (default: 10)\n");
5959
printf(" --seed N PRNG seed (default: 42)\n");
6060
printf(" --stability-threshold N CV%% threshold (default: 15)\n");
61-
printf(" --max-retries N Max retries for unstable benchmarks (default: 0)\n\n");
61+
printf(" --max-retries N Max retries for unstable benchmarks (default: 1)\n");
62+
printf(" --no-outlier-removal Use raw samples for headline mean/median/stddev/CV\n");
63+
printf(" --include-unstable-in-scores Include high-CV results in composite scores\n\n");
6264

6365
printf("Output:\n");
6466
printf(" --output-dir DIR Output directory (default: ./benchmark_results)\n");
@@ -184,6 +186,10 @@ bool parseArgs(int argc, char* argv[], BenchmarkConfig& config) {
184186
config.stability_threshold = atof(argv[++i]);
185187
} else if (arg == "--max-retries" && i + 1 < argc) {
186188
config.max_retries = atoi(argv[++i]);
189+
} else if (arg == "--no-outlier-removal") {
190+
config.remove_outliers = false;
191+
} else if (arg == "--include-unstable-in-scores") {
192+
config.exclude_unstable_from_scores = false;
187193
} else if (arg == "--compare" && i + 1 < argc) {
188194
config.compare_files = splitComma(argv[++i]);
189195
} else if (arg == "--threads" && i + 1 < argc) {
@@ -316,7 +322,12 @@ int main(int argc, char* argv[]) {
316322
config.resolutions[i].width, config.resolutions[i].height);
317323
}
318324
printf("\n Iterations: %d (warmup %d)\n", config.iterations, config.warmup);
319-
printf(" Mode: graph (single cv:: call on pre-allocated buffers)\n\n");
325+
printf(" Mode: graph (single cv:: call on pre-allocated buffers)\n");
326+
if (config.threads == 1) {
327+
printf(" Threading: pinned to 1 thread for cross-impl parity; "
328+
"use --threads 0 for OpenCV default (nproc) behavior\n");
329+
}
330+
printf("\n");
320331

321332
opencv_mark::OpenCVRunner runner(config);
322333
runner.addCases(opencv_mark::registerCvFilterBenchmarks());
@@ -367,7 +378,7 @@ int main(int argc, char* argv[]) {
367378
printf("\n=============================================================\n");
368379
printf(" Summary: %d total | %d passed | %d skipped | %d failed\n",
369380
total, passed, skipped, failed);
370-
auto scores = BenchmarkReport::computeScores(results);
381+
auto scores = report.computeScores(results);
371382
if (scores.vision_count > 0) {
372383
printf(" OpenCV Vision Score: %.2f MP/s (%d benchmarks)\n",
373384
scores.overall_vision_score, scores.vision_count);

opencv-mark/src/opencv_runner.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
122122
if (!ok) {
123123
result.verified = false;
124124
result.skip_reason = "output verification failed";
125+
return result;
125126
}
126127
}
127128

@@ -142,7 +143,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
142143
return result;
143144
}
144145

145-
result.wall_clock = BenchmarkStats::compute(samples);
146+
result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
146147
result.megapixels_per_sec = BenchmarkStats::computeThroughput(
147148
res.width, res.height, result.wall_clock.median_ns);
148149

@@ -163,7 +164,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
163164
timer.stop();
164165
samples.push_back(timer.elapsed_ns());
165166
}
166-
result.wall_clock = BenchmarkStats::compute(samples);
167+
result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
167168
result.megapixels_per_sec = BenchmarkStats::computeThroughput(
168169
res.width, res.height, result.wall_clock.median_ns);
169170
result.iterations = current_iters;

scripts/check_report.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/usr/bin/env python3
2+
"""Check an openvx-mark/opencv-mark JSON report for failures.
3+
4+
Returns non-zero if any benchmark result in the report is unsupported
5+
or unverified. Use --allow-feature-set to scope the check to specific
6+
feature sets (e.g. "vision,framework"), or --warn-only to print a
7+
summary without failing.
8+
"""
9+
10+
import argparse
11+
import json
12+
import sys
13+
14+
15+
def load_report(path):
16+
with open(path, "r") as f:
17+
return json.load(f)
18+
19+
20+
def main():
21+
parser = argparse.ArgumentParser(description="Verify benchmark report integrity")
22+
parser.add_argument("report", help="Path to benchmark_results.json")
23+
parser.add_argument(
24+
"--allow-feature-set",
25+
type=str,
26+
default="",
27+
help="Comma-separated feature sets to check (default: all)",
28+
)
29+
parser.add_argument(
30+
"--warn-only",
31+
action="store_true",
32+
help="Print summary but do not exit with failure",
33+
)
34+
args = parser.parse_args()
35+
36+
report = load_report(args.report)
37+
results = report.get("results", [])
38+
39+
allowed_sets = set()
40+
if args.allow_feature_set:
41+
allowed_sets = {s.strip() for s in args.allow_feature_set.split(",")}
42+
43+
unsupported = []
44+
unverified = []
45+
46+
for r in results:
47+
if allowed_sets and r.get("feature_set") not in allowed_sets:
48+
continue
49+
if not r.get("supported", True):
50+
unsupported.append(r)
51+
elif not r.get("verified", True):
52+
unverified.append(r)
53+
54+
total_checked = len(
55+
[r for r in results if not allowed_sets or r.get("feature_set") in allowed_sets]
56+
)
57+
58+
print(
59+
f"check_report: {total_checked} result(s) checked, "
60+
f"{len(unsupported)} unsupported, {len(unverified)} unverified"
61+
)
62+
63+
for r in unsupported[:5]:
64+
print(
65+
f" UNSUPPORTED: {r.get('name')} @ {r.get('resolution')} "
66+
f"({r.get('mode')}) — {r.get('skip_reason', 'no reason')}"
67+
)
68+
if len(unsupported) > 5:
69+
print(f" ... and {len(unsupported) - 5} more unsupported")
70+
71+
for r in unverified[:5]:
72+
print(
73+
f" UNVERIFIED: {r.get('name')} @ {r.get('resolution')} "
74+
f"({r.get('mode')}) — {r.get('skip_reason', 'output verification failed')}"
75+
)
76+
if len(unverified) > 5:
77+
print(f" ... and {len(unverified) - 5} more unverified")
78+
79+
if (unsupported or unverified) and not args.warn_only:
80+
sys.exit(1)
81+
82+
83+
if __name__ == "__main__":
84+
main()

0 commit comments

Comments
 (0)