Audit: actionable gaps/issues in benchmark execution and reporting
Summary
The openvx-mark suite is solid and well-engineered, but an audit surfaced several concrete gaps where correctness, fairness, or transparency can be improved. The most impactful are: (1) verification failures do not abort measurement, (2) many verify_fn checks are weak constant-input sanity tests, (3) immediate-mode coverage is incomplete, and (4) statistics/reporting hide important caveats from readers.
I’ve grouped findings by area and tried to make each actionable with a recommended fix.
1. Verification runs after warmup but before measurement — failures still burn measurement time
Location: src/benchmark_runner.cpp (runGraphMode / runImmediateMode)
Current behavior: bc.verify_fn(ctx) is called after the warmup loop and before the timed measurement loop. If verification fails, the runner sets result.verified = false and then proceeds to run the full measurement block anyway.
Problem:
- Verify failures still consume full measurement time.
- Reported timing numbers may be for garbage outputs.
- The
VERIFY FAILED line is visible but not a hard gate.
Recommended fix: Move verification to a hard gate before measurement. For graph mode this should happen right after vxVerifyGraph() and one representative execution; for immediate mode, after one representative execution. If verification fails, set supported = false or add a dedicated verify_failed terminal state and skip the timed loop.
2. Many verify_fn implementations are weak correctness signals
Location: opencv-mark/src/benchmarks/*.cpp, several src/benchmarks/node_*.cpp cases
Current behavior: Many verification functions use constant-input sanity checks (e.g., all-100 input → still 100 after a normalized filter) or identity transforms. These catch crashes but miss real numerical or semantic bugs.
Problem: A conformant-but-wrong implementation can pass most of these checks.
Recommended fix: Promote a subset of benchmarks to golden-output verification:
- Dump expected outputs from a reference implementation (Khronos sample / rustVX) into
test/golden/.
- Compare actual outputs against golden data with PSNR / max-abs-diff tolerances.
- Re-use the existing
--dump-outputs-dir and scripts/cross_verify_outputs.py plumbing, but make it part of the per-run verification path, not an optional post-hoc script.
3. immediate_func coverage is incomplete
Location: src/benchmarks/node_filters.cpp, node_color.cpp, node_extraction.cpp, node_tensor.cpp, node_statistical.cpp, node_pixelwise.cpp, pipeline_vision.cpp, pipeline_feature.cpp, etc.
Current behavior: Most BenchmarkCase registrations leave immediate_func = nullptr. Only src/benchmarks/immediate_benchmarks.cpp registers immediate-mode functions.
Problem: When running --mode both, only a tiny fraction of kernels actually exercise the immediate API, so graph-vs-immediate comparisons are lopsided.
Recommended fix: Add immediate-mode lambdas to each BenchmarkCase, using the vxu* immediate API (vxuBox3x3, vxuGaussian3x3, etc.) with the same parameters and resource tracking as the graph setup.
4. Statistics silently remove outliers before computing mean/median/CV
Location: src/benchmark_stats.cpp
Current behavior: BenchmarkStats::compute() removes outliers via IQR and then uses the cleaned set for mean, median, stddev, and CV. p5/p95/p99 are computed from the original sorted data, which is good, but the headline numbers come from the cleaned set.
Problem: Default outlier removal can hide real jitter and make the reported CV artificially low. Readers may not realize the primary metrics have been filtered.
Recommended fix:
- Option A: Report two stat blocks (raw vs. cleaned) and use raw samples for the headline mean/median.
- Option B: Add a CLI flag
--no-outlier-removal and document the cleaning step in generated reports.
- Option C: Only use outlier removal for the stability-warning heuristic, not for reported timing metrics.
5. Framework Score aggregates ratios of averages in a potentially misleading way
Location: src/benchmark_report.cpp, scripts/compare_reports.py
Current behavior: Framework benchmarks compute per-scenario ratios (e.g., vx_perf.avg / wall-clock median), then geomean those ratios into the Framework Score.
Problem: The geomean of ratios is not equivalent to the ratio of geomeans. Scenarios with very different sample counts or scales are weighted equally by ratio, which can distort the composite.
Recommended fix: Compute the Framework Score from aggregated raw timing totals where possible, or add a confidence interval / stderr to the score so readers can judge its reliability.
6. Stability gating is off by default and warnings do not affect scores
Location: include/benchmark_config.h, src/benchmark_runner.cpp
Current behavior: stability_threshold = 15.0 but max_retries = 0 by default. Even when stability_warning is true, the result is still fed into composite scores.
Problem: High-CV results are flagged but still influence the final score. A single unstable kernel can distort the headline number.
Recommended fix:
- Raise
max_retries default to at least 1.
- Add
--exclude-unstable-from-score and exclude stability_warning / retry_count > 0 results from composite scores by default.
- Make stability warnings visually prominent in Markdown/JSON reports.
7. --threads 1 default may surprise users expecting peak performance
Location: include/benchmark_config.h, src/main.cpp
Current behavior: threads = 1 by default for apples-to-apples comparison.
Problem: Users running the tool without reading the README may interpret single-threaded numbers as peak implementation performance.
Recommended fix: Print a one-line summary at startup: “Threading pinned to 1 for cross-impl parity; use --threads 0 for default library behavior.” Also update CLI help to flag this more prominently.
8. OpenCV-vs-OpenVX comparison framing is ambiguous
Location: scripts/compare_reports.py, README.md
Current behavior: Reports compare an OpenVX graph execution against sequential OpenCV per-kernel calls.
Problem: The comparison is valid for “replace my OpenCV code with an OpenVX graph” but not for “OpenVX graph framework vs. an equivalent OpenCV graph framework.” Readers may misinterpret the numbers.
Recommended fix: Add an interpretation note in every generated comparison report, e.g.:
“This compares OpenVX graph-mode execution against sequential OpenCV function calls. It measures the benefit of adopting an OpenVX graph, not a graph-framework-vs-graph-framework contest.”
Optionally provide an OpenCV pipeline mode (e.g., cv::G-API or cv::UMat graph) for a fairer framework comparison.
9. omp_get_max_threads detection via dlsym(RTLD_DEFAULT) is fragile
Location: src/bench_runtime.cpp
Current behavior: OpenMP detection uses dlsym(RTLD_DEFAULT, "omp_get_max_threads").
Problem: In static builds or when OpenMP symbols are hidden, this returns 0 even if OpenMP is present.
Recommended fix: Document the limitation, and/or add a compile-time __has_include(<omp.h>) fallback for static builds.
10. CI does not exercise the in-process verification path
Location: .github/workflows/ci.yml
Current behavior: CI runs split halves and merges reports; cross-impl output verification is done by scripts/cross_verify_outputs.py on dumped outputs.
Problem: The per-run verify_fn success/failure path is not explicitly checked in CI.
Recommended fix: Add a CI step that runs a small --kernels subset and asserts verified == true and supported == true for every case in the JSON report.
11. Impl-specific numerical issues are handled by workarounds, not root-cause fixes
Location: CHANGELOG.md, src/benchmarks/node_multiscale.cpp, tensor benchmarks
Known issues:
- Khronos sample tensor kernels crash.
- S16
LaplacianPyramid has error-code differences.
MatchTemplate INT16 saturation differs across implementations.
Recommended fix: Track root-cause fixes separately:
- Conformance issue for Khronos tensor implementation.
- MIVisionX behavior for S16 Laplacian pyramid error codes.
- Clarify/specify INT16 saturation semantics for MatchTemplate.
12. test_data_generator.cpp uses the same seed for every benchmark
Location: src/test_data_generator.cpp, src/benchmark_runner.cpp
Current behavior: Every case uses config_.seed (default 42) with no per-case mixing.
Problem: Different kernels receive identical random inputs, which can accidentally hide correlation bugs.
Recommended fix: Derive a per-case seed, e.g.:
uint64_t case_seed = config_.seed + std::hash<std::string>{}(bc.name + "_" + res.name);
TestDataGenerator gen(case_seed);
13. ResourceTracker silently ignores release errors
Location: include/resource_tracker.h / src/resource_tracker.cpp
Current behavior: Objects are released in destructor order without checking return codes.
Problem: A release failure during teardown is invisible.
Recommended fix: Add debug-only assertions or verbose logging when vxRelease* calls return non-success.
14. “Vision 42/42” means benchmark coverage, not CTS conformance
Location: README.md, commit messages
Current behavior: The project advertises “Vision 42/42.”
Problem: This is sometimes read as “passes CTS for 42 kernels,” but it means the benchmark registers a case for each conformance-profile kernel.
Recommended fix: Clarify wording: “42/42 Vision-profile benchmark coverage” vs. “CTS-conformant implementation.”
15. opencv-mark still has 4 unmatchable vision variants
Location: opencv-mark/src/benchmarks/*.cpp
Current behavior: ColorConvert_RGB2NV12, ChannelExtract_NV12_Y, ChannelExtract_IYUV_U, and TableLookup_S16 have no OpenCV equivalent and are silently skipped in comparison reports.
Problem: Readers may assume OpenVX won by default in those rows.
Recommended fix: Add an explicit “uncomparable by spec/library limitation” section in comparison reports.
16. No built-in regression baseline comparison
Location: scripts/compare_reports.py
Current behavior: Reports are standalone; there is no way to compare against a stored baseline.
Recommended fix: Add a --baseline <json> flag and a regression summary section in the Markdown output.
17. vx_perf.median_ns is actually avg
Location: src/benchmark_runner.cpp
Current behavior: result.vx_perf.median_ns = static_cast<double>(perf.avg) because vx_perf_t has no median.
Problem: Wall-clock median vs. vx_perf comparisons are asymmetric and may mislead.
Recommended fix: Document the approximation explicitly in reports, or rename the field to avoid calling it a median.
Top 5 recommended fixes (in order of impact)
- Stop timing after verify failure — make
verify_fn a hard pre-measurement gate.
- Strengthen verification with golden outputs / PSNR instead of constant-input checks.
- Complete immediate-mode coverage across all
node_*.cpp files.
- Be transparent about outlier removal and stats defaults in generated reports.
- Clarify OpenCV comparison framing in every generated comparison report.
I’m happy to help implement any of these if there’s interest.
Audit: actionable gaps/issues in benchmark execution and reporting
Summary
The
openvx-marksuite is solid and well-engineered, but an audit surfaced several concrete gaps where correctness, fairness, or transparency can be improved. The most impactful are: (1) verification failures do not abort measurement, (2) manyverify_fnchecks are weak constant-input sanity tests, (3) immediate-mode coverage is incomplete, and (4) statistics/reporting hide important caveats from readers.I’ve grouped findings by area and tried to make each actionable with a recommended fix.
1. Verification runs after warmup but before measurement — failures still burn measurement time
Location:
src/benchmark_runner.cpp(runGraphMode/runImmediateMode)Current behavior:
bc.verify_fn(ctx)is called after the warmup loop and before the timed measurement loop. If verification fails, the runner setsresult.verified = falseand then proceeds to run the full measurement block anyway.Problem:
VERIFY FAILEDline is visible but not a hard gate.Recommended fix: Move verification to a hard gate before measurement. For graph mode this should happen right after
vxVerifyGraph()and one representative execution; for immediate mode, after one representative execution. If verification fails, setsupported = falseor add a dedicatedverify_failedterminal state and skip the timed loop.2. Many
verify_fnimplementations are weak correctness signalsLocation:
opencv-mark/src/benchmarks/*.cpp, severalsrc/benchmarks/node_*.cppcasesCurrent behavior: Many verification functions use constant-input sanity checks (e.g., all-100 input → still 100 after a normalized filter) or identity transforms. These catch crashes but miss real numerical or semantic bugs.
Problem: A conformant-but-wrong implementation can pass most of these checks.
Recommended fix: Promote a subset of benchmarks to golden-output verification:
test/golden/.--dump-outputs-dirandscripts/cross_verify_outputs.pyplumbing, but make it part of the per-run verification path, not an optional post-hoc script.3.
immediate_funccoverage is incompleteLocation:
src/benchmarks/node_filters.cpp,node_color.cpp,node_extraction.cpp,node_tensor.cpp,node_statistical.cpp,node_pixelwise.cpp,pipeline_vision.cpp,pipeline_feature.cpp, etc.Current behavior: Most
BenchmarkCaseregistrations leaveimmediate_func = nullptr. Onlysrc/benchmarks/immediate_benchmarks.cppregisters immediate-mode functions.Problem: When running
--mode both, only a tiny fraction of kernels actually exercise the immediate API, so graph-vs-immediate comparisons are lopsided.Recommended fix: Add immediate-mode lambdas to each
BenchmarkCase, using thevxu*immediate API (vxuBox3x3,vxuGaussian3x3, etc.) with the same parameters and resource tracking as the graph setup.4. Statistics silently remove outliers before computing mean/median/CV
Location:
src/benchmark_stats.cppCurrent behavior:
BenchmarkStats::compute()removes outliers via IQR and then uses the cleaned set for mean, median, stddev, and CV. p5/p95/p99 are computed from the original sorted data, which is good, but the headline numbers come from the cleaned set.Problem: Default outlier removal can hide real jitter and make the reported CV artificially low. Readers may not realize the primary metrics have been filtered.
Recommended fix:
--no-outlier-removaland document the cleaning step in generated reports.5. Framework Score aggregates ratios of averages in a potentially misleading way
Location:
src/benchmark_report.cpp,scripts/compare_reports.pyCurrent behavior: Framework benchmarks compute per-scenario ratios (e.g.,
vx_perf.avg/ wall-clock median), then geomean those ratios into the Framework Score.Problem: The geomean of ratios is not equivalent to the ratio of geomeans. Scenarios with very different sample counts or scales are weighted equally by ratio, which can distort the composite.
Recommended fix: Compute the Framework Score from aggregated raw timing totals where possible, or add a confidence interval / stderr to the score so readers can judge its reliability.
6. Stability gating is off by default and warnings do not affect scores
Location:
include/benchmark_config.h,src/benchmark_runner.cppCurrent behavior:
stability_threshold = 15.0butmax_retries = 0by default. Even whenstability_warningis true, the result is still fed into composite scores.Problem: High-CV results are flagged but still influence the final score. A single unstable kernel can distort the headline number.
Recommended fix:
max_retriesdefault to at least 1.--exclude-unstable-from-scoreand excludestability_warning/retry_count > 0results from composite scores by default.7.
--threads 1default may surprise users expecting peak performanceLocation:
include/benchmark_config.h,src/main.cppCurrent behavior:
threads = 1by default for apples-to-apples comparison.Problem: Users running the tool without reading the README may interpret single-threaded numbers as peak implementation performance.
Recommended fix: Print a one-line summary at startup: “Threading pinned to 1 for cross-impl parity; use
--threads 0for default library behavior.” Also update CLI help to flag this more prominently.8. OpenCV-vs-OpenVX comparison framing is ambiguous
Location:
scripts/compare_reports.py,README.mdCurrent behavior: Reports compare an OpenVX graph execution against sequential OpenCV per-kernel calls.
Problem: The comparison is valid for “replace my OpenCV code with an OpenVX graph” but not for “OpenVX graph framework vs. an equivalent OpenCV graph framework.” Readers may misinterpret the numbers.
Recommended fix: Add an interpretation note in every generated comparison report, e.g.:
Optionally provide an OpenCV pipeline mode (e.g.,
cv::G-APIorcv::UMatgraph) for a fairer framework comparison.9.
omp_get_max_threadsdetection viadlsym(RTLD_DEFAULT)is fragileLocation:
src/bench_runtime.cppCurrent behavior: OpenMP detection uses
dlsym(RTLD_DEFAULT, "omp_get_max_threads").Problem: In static builds or when OpenMP symbols are hidden, this returns 0 even if OpenMP is present.
Recommended fix: Document the limitation, and/or add a compile-time
__has_include(<omp.h>)fallback for static builds.10. CI does not exercise the in-process verification path
Location:
.github/workflows/ci.ymlCurrent behavior: CI runs split halves and merges reports; cross-impl output verification is done by
scripts/cross_verify_outputs.pyon dumped outputs.Problem: The per-run
verify_fnsuccess/failure path is not explicitly checked in CI.Recommended fix: Add a CI step that runs a small
--kernelssubset and assertsverified == trueandsupported == truefor every case in the JSON report.11. Impl-specific numerical issues are handled by workarounds, not root-cause fixes
Location:
CHANGELOG.md,src/benchmarks/node_multiscale.cpp, tensor benchmarksKnown issues:
LaplacianPyramidhas error-code differences.MatchTemplateINT16 saturation differs across implementations.Recommended fix: Track root-cause fixes separately:
12.
test_data_generator.cppuses the same seed for every benchmarkLocation:
src/test_data_generator.cpp,src/benchmark_runner.cppCurrent behavior: Every case uses
config_.seed(default 42) with no per-case mixing.Problem: Different kernels receive identical random inputs, which can accidentally hide correlation bugs.
Recommended fix: Derive a per-case seed, e.g.:
13.
ResourceTrackersilently ignores release errorsLocation:
include/resource_tracker.h/src/resource_tracker.cppCurrent behavior: Objects are released in destructor order without checking return codes.
Problem: A release failure during teardown is invisible.
Recommended fix: Add debug-only assertions or verbose logging when
vxRelease*calls return non-success.14. “Vision 42/42” means benchmark coverage, not CTS conformance
Location:
README.md, commit messagesCurrent behavior: The project advertises “Vision 42/42.”
Problem: This is sometimes read as “passes CTS for 42 kernels,” but it means the benchmark registers a case for each conformance-profile kernel.
Recommended fix: Clarify wording: “42/42 Vision-profile benchmark coverage” vs. “CTS-conformant implementation.”
15.
opencv-markstill has 4 unmatchable vision variantsLocation:
opencv-mark/src/benchmarks/*.cppCurrent behavior:
ColorConvert_RGB2NV12,ChannelExtract_NV12_Y,ChannelExtract_IYUV_U, andTableLookup_S16have no OpenCV equivalent and are silently skipped in comparison reports.Problem: Readers may assume OpenVX won by default in those rows.
Recommended fix: Add an explicit “uncomparable by spec/library limitation” section in comparison reports.
16. No built-in regression baseline comparison
Location:
scripts/compare_reports.pyCurrent behavior: Reports are standalone; there is no way to compare against a stored baseline.
Recommended fix: Add a
--baseline <json>flag and a regression summary section in the Markdown output.17.
vx_perf.median_nsis actuallyavgLocation:
src/benchmark_runner.cppCurrent behavior:
result.vx_perf.median_ns = static_cast<double>(perf.avg)becausevx_perf_thas no median.Problem: Wall-clock median vs.
vx_perfcomparisons are asymmetric and may mislead.Recommended fix: Document the approximation explicitly in reports, or rename the field to avoid calling it a median.
Top 5 recommended fixes (in order of impact)
verify_fna hard pre-measurement gate.node_*.cppfiles.I’m happy to help implement any of these if there’s interest.