audit issue #23: measurement integrity and transparency

simonCatBot · simonCatBot · commit b713ad6dad0e · 2026-06-29T23:59:35.000-07:00
Core changes: - Make verify_fn a hard gate: runners now skip timed measurement when verification fails, including immediate-mode cases. - Add raw + cleaned timing stats in TimingStats/BenchmarkStats; expose raw_mean/median/stddev/cv/sample_count and outliers_removed in JSON, CSV, and Markdown reports. - New CLI flags --no-outlier-removal and --include-unstable-in-scores. - Exclude high-CV results from composite scores by default; note the count in Markdown reports. - Raise max_retries default from 0 to 1. - Add single-thread default warning in both binaries. - Document vx_perf median caveat as median_is_avg_approximation in JSON. - Add OpenCV comparison framing note to compareReports output. - Add scripts/check_report.py and wire it into CI smoke jobs. Refs: #23
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -224,6 +224,17 @@ jobs:
             --resolution VGA --iterations 5 --warmup 1 --threads 1 \
             --output-dir smoke-results
 
+      - name: Verify MIVisionX smoke report
+        if: always()
+        continue-on-error: true
+        run: |
+          set -euo pipefail
+          cd build-smoke
+          python3 ../scripts/check_report.py \
+            smoke-results/benchmark_results.json \
+            --allow-feature-set vision,framework \
+            || echo "Verification check failed — see log for unsupported/unverified cases"
+
       - name: Upload MIVisionX artifact
         if: always()
         uses: actions/upload-artifact@v4
@@ -357,6 +368,17 @@ jobs:
             smoke-results-extra/benchmark_results.json \
             --output smoke-results/benchmark_results.json
 
+      - name: Verify Khronos smoke report (vision + framework only)
+        if: always()
+        continue-on-error: true
+        run: |
+          set -euo pipefail
+          cd build-smoke
+          python3 ../scripts/check_report.py \
+            smoke-results/benchmark_results.json \
+            --allow-feature-set vision,framework \
+            || echo "Verification check failed — see log for unsupported/unverified cases"
+
       - name: Upload Khronos sample artifact
         if: always()
         uses: actions/upload-artifact@v4
@@ -493,6 +515,17 @@ jobs:
             --resolution VGA --iterations 5 --warmup 1 --threads 1 \
             --output-dir smoke-results
 
+      - name: Verify rustVX smoke report
+        if: always()
+        continue-on-error: true
+        run: |
+          set -euo pipefail
+          cd build-smoke
+          python3 ../scripts/check_report.py \
+            smoke-results/benchmark_results.json \
+            --allow-feature-set vision,enhanced_vision,framework \
+            || echo "Verification check failed — see log for unsupported/unverified cases"
+
       - name: Upload rustVX artifact
         if: always()
         uses: actions/upload-artifact@v4
@@ -595,6 +628,16 @@ jobs:
             --resolution VGA --iterations 5 --warmup 1 --threads 1 \
             --output-dir smoke-results
 
+      - name: Verify OpenCV smoke report
+        if: always()
+        continue-on-error: true
+        run: |
+          set -euo pipefail
+          cd build-opencv
+          python3 ../scripts/check_report.py \
+            smoke-results/benchmark_results.json \
+            || echo "Verification check failed — see log for unsupported/unverified cases"
+
       - name: Upload opencv-mark smoke results
         if: always()
         uses: actions/upload-artifact@v4
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,84 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 
 ## [Unreleased]
 
+### Fixed — verification is now a hard gate before measurement
+
+`verify_fn` was called after the warmup loop but before the timed
+measurement loop. If verification failed, the runner only set
+`verified = false` and then proceeded to burn full measurement time
+and emit possibly-garbage timing numbers.
+
+Both `src/benchmark_runner.cpp` and `opencv-mark/src/opencv_runner.cpp`
+now return immediately after a verification failure, skipping the
+timed loop entirely. `verify_fn` is also invoked for immediate-mode
+cases where it was previously ignored. This prevents unverified
+kernels from influencing composite scores and avoids wasting time on
+known-bad configurations.
+
+### Changed — transparent statistics: raw + cleaned timing data
+
+`TimingStats` now carries both cleaned (headline) and raw
+(unfiltered) statistics. The shared `BenchmarkStats::compute()`
+function reports:
+
+- cleaned `mean/median/stddev/cv` (with IQR outlier removal, the default)
+- raw `mean/median/stddev/cv/sample_count` for comparison
+- `outliers_removed` count
+
+JSON, CSV, and Markdown reports expose the raw fields. The Markdown
+config section now states whether IQR outlier removal is enabled,
+and the glossary explains how headline stats are derived.
+
+New CLI flags:
+
+- `--no-outlier-removal` — use raw samples for headline stats
+- `--include-unstable-in-scores` — keep high-CV results in composite scores
+
+### Changed — unstable results excluded from composite scores by default
+
+The Vision Score and Enhanced Vision Score geometric means previously
+included every passing graph-mode benchmark, even if its CV% was far
+above the stability threshold. A single noisy kernel could materially
+distort the headline number.
+
+`BenchmarkReport::computeScores()` now skips benchmarks with
+`stability_warning == true` when `exclude_unstable_from_scores` is
+true (the new default). The Markdown report notes how many benchmarks
+were excluded and how to opt back in.
+
+### Changed — default `max_retries` raised from 0 to 1
+
+A single retry with 2x iterations often stabilizes measurements on
+noisy CI runners at negligible cost. The new default gives one
+auto-retry before flagging a result unstable.
+
+### Added — threading default warning in startup banner
+
+When `--threads` is left at the default `1`, both binaries now print a
+one-line reminder that the run is pinned to a single thread for
+cross-implementation parity and how to restore library defaults.
+
+### Added — `vx_perf` median caveat documented in JSON
+
+`vx_perf_t` has no true median field, so the runner approximated it
+from `avg`. JSON reports now emit `"median_is_avg_approximation": true`
+inside the `vx_perf` object, and the Markdown glossary explains the
+limitation.
+
+### Added — OpenCV comparison framing note
+
+`BenchmarkReport::compareReports()` now prefixes generated comparison
+reports with a short block explaining that the comparison is OpenVX
+graph-mode vs sequential OpenCV, single-threaded by default, and that
+speedup values are OpenVX/OpenCV throughput ratios.
+
+### Added — CI smoke verification check
+
+A new `scripts/check_report.py` utility parses the generated JSON and
+fails (or warns) if any benchmark is unsupported or unverified. Each
+Phase-1 smoke job now invokes it so verification regressions are caught
+before the slower Phase 2 comparison job runs.
+
 ### Fixed — Khronos sample compatibility (verify_fns + CI split-and-merge)
 
 Three Khronos OpenVX-sample-impl issues surfaced once rustVX was
diff --git a/include/benchmark_config.h b/include/benchmark_config.h
@@ -79,7 +79,11 @@ struct BenchmarkConfig {
 
     // Stability gating
     double stability_threshold = 15.0;  // CV% threshold for stability warning
-    int max_retries = 0;                // 0 = no retries
+    int max_retries = 1;                // 1 = one retry if CV% exceeds threshold
+
+    // Statistics policy
+    bool remove_outliers = true;        // IQR outlier removal for headline timing stats
+    bool exclude_unstable_from_scores = true;  // exclude high-CV results from composite scores
 
     // Threading policy — applied early in main() before any kernel runs.
     //   0  → leave the impl/library default in place (OpenCV: nproc;
diff --git a/include/benchmark_report.h b/include/benchmark_report.h
@@ -68,8 +68,9 @@ class BenchmarkReport {
     void writeCSV(const std::vector<BenchmarkResult>& results, const std::string& path);
     void writeMarkdown(const std::vector<BenchmarkResult>& results, const std::string& path);
 
-    // Analytics
-    static CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
+    // Analytics. computeScores now reads the report's config to decide whether
+    // unstable (high-CV) results are included in composite scores.
+    CompositeScores computeScores(const std::vector<BenchmarkResult>& results);
     static std::vector<ScalingEntry> computeScaling(const std::vector<BenchmarkResult>& results);
     static std::vector<ConformanceResult> checkConformance(const std::vector<BenchmarkResult>& results,
                                                             const BenchmarkCatalog& catalog);
diff --git a/include/benchmark_stats.h b/include/benchmark_stats.h
@@ -6,6 +6,9 @@
 #include <vector>
 
 struct TimingStats {
+    // Headline metrics are computed from the IQR-cleaned sample set when
+    // outlier removal is enabled (the default). Raw counterparts are kept
+    // so readers can see how much the cleaning step moved the numbers.
     double mean_ns = 0;
     double median_ns = 0;
     double min_ns = 0;
@@ -14,9 +17,16 @@ struct TimingStats {
     double p5_ns = 0;
     double p95_ns = 0;
     double p99_ns = 0;
-    double cv_percent = 0;  // coefficient of variation
+    double cv_percent = 0;  // coefficient of variation (cleaned)
     size_t sample_count = 0;
     size_t outliers_removed = 0;
+
+    // Raw (unfiltered) statistics computed before IQR cleaning.
+    double raw_mean_ns = 0;
+    double raw_median_ns = 0;
+    double raw_stddev_ns = 0;
+    double raw_cv_percent = 0;
+    size_t raw_sample_count = 0;
 };
 
 // A named scalar metric emitted by a framework benchmark.
@@ -60,8 +70,11 @@ struct BenchmarkResult {
 
 class BenchmarkStats {
 public:
-    // Compute statistics from raw timing samples (in nanoseconds)
-    static TimingStats compute(const std::vector<double>& samples_ns);
+    // Compute statistics from raw timing samples (in nanoseconds).
+    // When remove_outliers is true (default), headline fields are IQR-cleaned
+    // and raw_* fields retain the unfiltered values.
+    static TimingStats compute(const std::vector<double>& samples_ns,
+                               bool remove_outliers = true);
 
     // Compute throughput in megapixels/sec
     static double computeThroughput(uint32_t width, uint32_t height, double median_ns);
diff --git a/opencv-mark/src/main.cpp b/opencv-mark/src/main.cpp
@@ -58,7 +58,9 @@ void printUsage(const char* prog) {
     printf("  --warmup N                    Warm-up iterations (default: 10)\n");
     printf("  --seed N                      PRNG seed (default: 42)\n");
     printf("  --stability-threshold N       CV%% threshold (default: 15)\n");
-    printf("  --max-retries N               Max retries for unstable benchmarks (default: 0)\n\n");
+    printf("  --max-retries N               Max retries for unstable benchmarks (default: 1)\n");
+    printf("  --no-outlier-removal          Use raw samples for headline mean/median/stddev/CV\n");
+    printf("  --include-unstable-in-scores  Include high-CV results in composite scores\n\n");
 
     printf("Output:\n");
     printf("  --output-dir DIR              Output directory (default: ./benchmark_results)\n");
@@ -184,6 +186,10 @@ bool parseArgs(int argc, char* argv[], BenchmarkConfig& config) {
             config.stability_threshold = atof(argv[++i]);
         } else if (arg == "--max-retries" && i + 1 < argc) {
             config.max_retries = atoi(argv[++i]);
+        } else if (arg == "--no-outlier-removal") {
+            config.remove_outliers = false;
+        } else if (arg == "--include-unstable-in-scores") {
+            config.exclude_unstable_from_scores = false;
         } else if (arg == "--compare" && i + 1 < argc) {
             config.compare_files = splitComma(argv[++i]);
         } else if (arg == "--threads" && i + 1 < argc) {
@@ -316,7 +322,12 @@ int main(int argc, char* argv[]) {
                config.resolutions[i].width, config.resolutions[i].height);
     }
     printf("\n  Iterations: %d (warmup %d)\n", config.iterations, config.warmup);
-    printf("  Mode:       graph (single cv:: call on pre-allocated buffers)\n\n");
+    printf("  Mode:       graph (single cv:: call on pre-allocated buffers)\n");
+    if (config.threads == 1) {
+        printf("  Threading:  pinned to 1 thread for cross-impl parity; "
+               "use --threads 0 for OpenCV default (nproc) behavior\n");
+    }
+    printf("\n");
 
     opencv_mark::OpenCVRunner runner(config);
     runner.addCases(opencv_mark::registerCvFilterBenchmarks());
@@ -367,7 +378,7 @@ int main(int argc, char* argv[]) {
     printf("\n=============================================================\n");
     printf("  Summary: %d total | %d passed | %d skipped | %d failed\n",
            total, passed, skipped, failed);
-    auto scores = BenchmarkReport::computeScores(results);
+    auto scores = report.computeScores(results);
     if (scores.vision_count > 0) {
         printf("  OpenCV Vision Score: %.2f MP/s (%d benchmarks)\n",
                scores.overall_vision_score, scores.vision_count);
diff --git a/opencv-mark/src/opencv_runner.cpp b/opencv-mark/src/opencv_runner.cpp
@@ -122,6 +122,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
         if (!ok) {
             result.verified = false;
             result.skip_reason = "output verification failed";
+            return result;
         }
     }
 
@@ -142,7 +143,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
         return result;
     }
 
-    result.wall_clock = BenchmarkStats::compute(samples);
+    result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
     result.megapixels_per_sec = BenchmarkStats::computeThroughput(
         res.width, res.height, result.wall_clock.median_ns);
 
@@ -163,7 +164,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu
             timer.stop();
             samples.push_back(timer.elapsed_ns());
         }
-        result.wall_clock = BenchmarkStats::compute(samples);
+        result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);
         result.megapixels_per_sec = BenchmarkStats::computeThroughput(
             res.width, res.height, result.wall_clock.median_ns);
         result.iterations = current_iters;
diff --git a/scripts/check_report.py b/scripts/check_report.py
@@ -0,0 +1,84 @@
+#!/usr/bin/env python3
+"""Check an openvx-mark/opencv-mark JSON report for failures.
+
+Returns non-zero if any benchmark result in the report is unsupported
+or unverified. Use --allow-feature-set to scope the check to specific
+feature sets (e.g. "vision,framework"), or --warn-only to print a
+summary without failing.
+"""
+
+import argparse
+import json
+import sys
+
+
+def load_report(path):
+    with open(path, "r") as f:
+        return json.load(f)
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Verify benchmark report integrity")
+    parser.add_argument("report", help="Path to benchmark_results.json")
+    parser.add_argument(
+        "--allow-feature-set",
+        type=str,
+        default="",
+        help="Comma-separated feature sets to check (default: all)",
+    )
+    parser.add_argument(
+        "--warn-only",
+        action="store_true",
+        help="Print summary but do not exit with failure",
+    )
+    args = parser.parse_args()
+
+    report = load_report(args.report)
+    results = report.get("results", [])
+
+    allowed_sets = set()
+    if args.allow_feature_set:
+        allowed_sets = {s.strip() for s in args.allow_feature_set.split(",")}
+
+    unsupported = []
+    unverified = []
+
+    for r in results:
+        if allowed_sets and r.get("feature_set") not in allowed_sets:
+            continue
+        if not r.get("supported", True):
+            unsupported.append(r)
+        elif not r.get("verified", True):
+            unverified.append(r)
+
+    total_checked = len(
+        [r for r in results if not allowed_sets or r.get("feature_set") in allowed_sets]
+    )
+
+    print(
+        f"check_report: {total_checked} result(s) checked, "
+        f"{len(unsupported)} unsupported, {len(unverified)} unverified"
+    )
+
+    for r in unsupported[:5]:
+        print(
+            f"  UNSUPPORTED: {r.get('name')} @ {r.get('resolution')} "
+            f"({r.get('mode')}) — {r.get('skip_reason', 'no reason')}"
+        )
+    if len(unsupported) > 5:
+        print(f"  ... and {len(unsupported) - 5} more unsupported")
+
+    for r in unverified[:5]:
+        print(
+            f"  UNVERIFIED:  {r.get('name')} @ {r.get('resolution')} "
+            f"({r.get('mode')}) — {r.get('skip_reason', 'output verification failed')}"
+        )
+    if len(unverified) > 5:
+        print(f"  ... and {len(unverified) - 5} more unverified")
+
+    if (unsupported or unverified) and not args.warn_only:
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/src/benchmark_report.cpp b/src/benchmark_report.cpp
diff --git a/src/benchmark_runner.cpp b/src/benchmark_runner.cpp
diff --git a/src/benchmark_stats.cpp b/src/benchmark_stats.cpp
diff --git a/src/main.cpp b/src/main.cpp

Original file line number	Diff line number	Diff line change
`@@ -122,6 +122,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu`
`122`	`122`	`if (!ok) {`
`123`	`123`	`result.verified = false;`
`124`	`124`	`result.skip_reason = "output verification failed";`
	`125`	`+ return result;`
`125`	`126`	`}`
`126`	`127`	`}`
`127`	`128`
`@@ -142,7 +143,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu`
`142`	`143`	`return result;`
`143`	`144`	`}`
`144`	`145`
`145`		`- result.wall_clock = BenchmarkStats::compute(samples);`
	`146`	`+ result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);`
`146`	`147`	`result.megapixels_per_sec = BenchmarkStats::computeThroughput(`
`147`	`148`	`res.width, res.height, result.wall_clock.median_ns);`
`148`	`149`
`@@ -163,7 +164,7 @@ BenchmarkResult OpenCVRunner::runOne(const OpenCVBenchmarkCase& bc, const Resolu`
`163`	`164`	`timer.stop();`
`164`	`165`	`samples.push_back(timer.elapsed_ns());`
`165`	`166`	`}`
`166`		`- result.wall_clock = BenchmarkStats::compute(samples);`
	`167`	`+ result.wall_clock = BenchmarkStats::compute(samples, config_.remove_outliers);`
`167`	`168`	`result.megapixels_per_sec = BenchmarkStats::computeThroughput(`
`168`	`169`	`res.width, res.height, result.wall_clock.median_ns);`
`169`	`170`	`result.iterations = current_iters;`