Audit: actionable gaps/issues in benchmark execution and reporting

# Audit: actionable gaps/issues in benchmark execution and reporting

## Summary

The `openvx-mark` suite is solid and well-engineered, but an audit surfaced several concrete gaps where correctness, fairness, or transparency can be improved. The most impactful are: (1) verification failures do not abort measurement, (2) many `verify_fn` checks are weak constant-input sanity tests, (3) immediate-mode coverage is incomplete, and (4) statistics/reporting hide important caveats from readers.

I’ve grouped findings by area and tried to make each actionable with a recommended fix.

---

## 1. Verification runs after warmup but before measurement — failures still burn measurement time

**Location:** `src/benchmark_runner.cpp` (`runGraphMode` / `runImmediateMode`)

**Current behavior:** `bc.verify_fn(ctx)` is called after the warmup loop and before the timed measurement loop. If verification fails, the runner sets `result.verified = false` and then proceeds to run the full measurement block anyway.

**Problem:**
- Verify failures still consume full measurement time.
- Reported timing numbers may be for garbage outputs.
- The `VERIFY FAILED` line is visible but not a hard gate.

**Recommended fix:** Move verification to a hard gate before measurement. For graph mode this should happen right after `vxVerifyGraph()` and one representative execution; for immediate mode, after one representative execution. If verification fails, set `supported = false` or add a dedicated `verify_failed` terminal state and skip the timed loop.

---

## 2. Many `verify_fn` implementations are weak correctness signals

**Location:** `opencv-mark/src/benchmarks/*.cpp`, several `src/benchmarks/node_*.cpp` cases

**Current behavior:** Many verification functions use constant-input sanity checks (e.g., all-100 input → still 100 after a normalized filter) or identity transforms. These catch crashes but miss real numerical or semantic bugs.

**Problem:** A conformant-but-wrong implementation can pass most of these checks.

**Recommended fix:** Promote a subset of benchmarks to golden-output verification:
- Dump expected outputs from a reference implementation (Khronos sample / rustVX) into `test/golden/`.
- Compare actual outputs against golden data with PSNR / max-abs-diff tolerances.
- Re-use the existing `--dump-outputs-dir` and `scripts/cross_verify_outputs.py` plumbing, but make it part of the per-run verification path, not an optional post-hoc script.

---

## 3. `immediate_func` coverage is incomplete

**Location:** `src/benchmarks/node_filters.cpp`, `node_color.cpp`, `node_extraction.cpp`, `node_tensor.cpp`, `node_statistical.cpp`, `node_pixelwise.cpp`, `pipeline_vision.cpp`, `pipeline_feature.cpp`, etc.

**Current behavior:** Most `BenchmarkCase` registrations leave `immediate_func = nullptr`. Only `src/benchmarks/immediate_benchmarks.cpp` registers immediate-mode functions.

**Problem:** When running `--mode both`, only a tiny fraction of kernels actually exercise the immediate API, so graph-vs-immediate comparisons are lopsided.

**Recommended fix:** Add immediate-mode lambdas to each `BenchmarkCase`, using the `vxu*` immediate API (`vxuBox3x3`, `vxuGaussian3x3`, etc.) with the same parameters and resource tracking as the graph setup.

---

## 4. Statistics silently remove outliers before computing mean/median/CV

**Location:** `src/benchmark_stats.cpp`

**Current behavior:** `BenchmarkStats::compute()` removes outliers via IQR and then uses the cleaned set for mean, median, stddev, and CV. p5/p95/p99 are computed from the original sorted data, which is good, but the headline numbers come from the cleaned set.

**Problem:** Default outlier removal can hide real jitter and make the reported CV artificially low. Readers may not realize the primary metrics have been filtered.

**Recommended fix:**
- Option A: Report two stat blocks (raw vs. cleaned) and use raw samples for the headline mean/median.
- Option B: Add a CLI flag `--no-outlier-removal` and document the cleaning step in generated reports.
- Option C: Only use outlier removal for the stability-warning heuristic, not for reported timing metrics.

---

## 5. Framework Score aggregates ratios of averages in a potentially misleading way

**Location:** `src/benchmark_report.cpp`, `scripts/compare_reports.py`

**Current behavior:** Framework benchmarks compute per-scenario ratios (e.g., `vx_perf.avg` / wall-clock median), then geomean those ratios into the Framework Score.

**Problem:** The geomean of ratios is not equivalent to the ratio of geomeans. Scenarios with very different sample counts or scales are weighted equally by ratio, which can distort the composite.

**Recommended fix:** Compute the Framework Score from aggregated raw timing totals where possible, or add a confidence interval / stderr to the score so readers can judge its reliability.

---

## 6. Stability gating is off by default and warnings do not affect scores

**Location:** `include/benchmark_config.h`, `src/benchmark_runner.cpp`

**Current behavior:** `stability_threshold = 15.0` but `max_retries = 0` by default. Even when `stability_warning` is true, the result is still fed into composite scores.

**Problem:** High-CV results are flagged but still influence the final score. A single unstable kernel can distort the headline number.

**Recommended fix:**
- Raise `max_retries` default to at least 1.
- Add `--exclude-unstable-from-score` and exclude `stability_warning` / `retry_count > 0` results from composite scores by default.
- Make stability warnings visually prominent in Markdown/JSON reports.

---

## 7. `--threads 1` default may surprise users expecting peak performance

**Location:** `include/benchmark_config.h`, `src/main.cpp`

**Current behavior:** `threads = 1` by default for apples-to-apples comparison.

**Problem:** Users running the tool without reading the README may interpret single-threaded numbers as peak implementation performance.

**Recommended fix:** Print a one-line summary at startup: “Threading pinned to 1 for cross-impl parity; use `--threads 0` for default library behavior.” Also update CLI help to flag this more prominently.

---

## 8. OpenCV-vs-OpenVX comparison framing is ambiguous

**Location:** `scripts/compare_reports.py`, `README.md`

**Current behavior:** Reports compare an OpenVX graph execution against sequential OpenCV per-kernel calls.

**Problem:** The comparison is valid for “replace my OpenCV code with an OpenVX graph” but not for “OpenVX graph framework vs. an equivalent OpenCV graph framework.” Readers may misinterpret the numbers.

**Recommended fix:** Add an interpretation note in every generated comparison report, e.g.:
> “This compares OpenVX graph-mode execution against sequential OpenCV function calls. It measures the benefit of adopting an OpenVX graph, not a graph-framework-vs-graph-framework contest.”

Optionally provide an OpenCV pipeline mode (e.g., `cv::G-API` or `cv::UMat` graph) for a fairer framework comparison.

---

## 9. `omp_get_max_threads` detection via `dlsym(RTLD_DEFAULT)` is fragile

**Location:** `src/bench_runtime.cpp`

**Current behavior:** OpenMP detection uses `dlsym(RTLD_DEFAULT, "omp_get_max_threads")`.

**Problem:** In static builds or when OpenMP symbols are hidden, this returns 0 even if OpenMP is present.

**Recommended fix:** Document the limitation, and/or add a compile-time `__has_include(<omp.h>)` fallback for static builds.

---

## 10. CI does not exercise the in-process verification path

**Location:** `.github/workflows/ci.yml`

**Current behavior:** CI runs split halves and merges reports; cross-impl output verification is done by `scripts/cross_verify_outputs.py` on dumped outputs.

**Problem:** The per-run `verify_fn` success/failure path is not explicitly checked in CI.

**Recommended fix:** Add a CI step that runs a small `--kernels` subset and asserts `verified == true` and `supported == true` for every case in the JSON report.

---

## 11. Impl-specific numerical issues are handled by workarounds, not root-cause fixes

**Location:** `CHANGELOG.md`, `src/benchmarks/node_multiscale.cpp`, tensor benchmarks

**Known issues:**
- Khronos sample tensor kernels crash.
- S16 `LaplacianPyramid` has error-code differences.
- `MatchTemplate` INT16 saturation differs across implementations.

**Recommended fix:** Track root-cause fixes separately:
- Conformance issue for Khronos tensor implementation.
- MIVisionX behavior for S16 Laplacian pyramid error codes.
- Clarify/specify INT16 saturation semantics for MatchTemplate.

---

## 12. `test_data_generator.cpp` uses the same seed for every benchmark

**Location:** `src/test_data_generator.cpp`, `src/benchmark_runner.cpp`

**Current behavior:** Every case uses `config_.seed` (default 42) with no per-case mixing.

**Problem:** Different kernels receive identical random inputs, which can accidentally hide correlation bugs.

**Recommended fix:** Derive a per-case seed, e.g.:
```cpp
uint64_t case_seed = config_.seed + std::hash<std::string>{}(bc.name + "_" + res.name);
TestDataGenerator gen(case_seed);
```

---

## 13. `ResourceTracker` silently ignores release errors

**Location:** `include/resource_tracker.h` / `src/resource_tracker.cpp`

**Current behavior:** Objects are released in destructor order without checking return codes.

**Problem:** A release failure during teardown is invisible.

**Recommended fix:** Add debug-only assertions or verbose logging when `vxRelease*` calls return non-success.

---

## 14. “Vision 42/42” means benchmark coverage, not CTS conformance

**Location:** `README.md`, commit messages

**Current behavior:** The project advertises “Vision 42/42.”

**Problem:** This is sometimes read as “passes CTS for 42 kernels,” but it means the benchmark registers a case for each conformance-profile kernel.

**Recommended fix:** Clarify wording: “42/42 Vision-profile benchmark coverage” vs. “CTS-conformant implementation.”

---

## 15. `opencv-mark` still has 4 unmatchable vision variants

**Location:** `opencv-mark/src/benchmarks/*.cpp`

**Current behavior:** `ColorConvert_RGB2NV12`, `ChannelExtract_NV12_Y`, `ChannelExtract_IYUV_U`, and `TableLookup_S16` have no OpenCV equivalent and are silently skipped in comparison reports.

**Problem:** Readers may assume OpenVX won by default in those rows.

**Recommended fix:** Add an explicit “uncomparable by spec/library limitation” section in comparison reports.

---

## 16. No built-in regression baseline comparison

**Location:** `scripts/compare_reports.py`

**Current behavior:** Reports are standalone; there is no way to compare against a stored baseline.

**Recommended fix:** Add a `--baseline <json>` flag and a regression summary section in the Markdown output.

---

## 17. `vx_perf.median_ns` is actually `avg`

**Location:** `src/benchmark_runner.cpp`

**Current behavior:** `result.vx_perf.median_ns = static_cast<double>(perf.avg)` because `vx_perf_t` has no median.

**Problem:** Wall-clock median vs. `vx_perf` comparisons are asymmetric and may mislead.

**Recommended fix:** Document the approximation explicitly in reports, or rename the field to avoid calling it a median.

---

## Top 5 recommended fixes (in order of impact)

1. **Stop timing after verify failure** — make `verify_fn` a hard pre-measurement gate.
2. **Strengthen verification** with golden outputs / PSNR instead of constant-input checks.
3. **Complete immediate-mode coverage** across all `node_*.cpp` files.
4. **Be transparent about outlier removal and stats defaults** in generated reports.
5. **Clarify OpenCV comparison framing** in every generated comparison report.

I’m happy to help implement any of these if there’s interest.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit: actionable gaps/issues in benchmark execution and reporting #23

Audit: actionable gaps/issues in benchmark execution and reporting

Summary

1. Verification runs after warmup but before measurement — failures still burn measurement time

2. Many `verify_fn` implementations are weak correctness signals

3. `immediate_func` coverage is incomplete

4. Statistics silently remove outliers before computing mean/median/CV

5. Framework Score aggregates ratios of averages in a potentially misleading way

6. Stability gating is off by default and warnings do not affect scores

7. `--threads 1` default may surprise users expecting peak performance

8. OpenCV-vs-OpenVX comparison framing is ambiguous

9. `omp_get_max_threads` detection via `dlsym(RTLD_DEFAULT)` is fragile

10. CI does not exercise the in-process verification path

11. Impl-specific numerical issues are handled by workarounds, not root-cause fixes

12. `test_data_generator.cpp` uses the same seed for every benchmark

13. `ResourceTracker` silently ignores release errors

14. “Vision 42/42” means benchmark coverage, not CTS conformance

15. `opencv-mark` still has 4 unmatchable vision variants

16. No built-in regression baseline comparison

17. `vx_perf.median_ns` is actually `avg`

Top 5 recommended fixes (in order of impact)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audit: actionable gaps/issues in benchmark execution and reporting #23

Description

Audit: actionable gaps/issues in benchmark execution and reporting

Summary

1. Verification runs after warmup but before measurement — failures still burn measurement time

2. Many verify_fn implementations are weak correctness signals

3. immediate_func coverage is incomplete

4. Statistics silently remove outliers before computing mean/median/CV

5. Framework Score aggregates ratios of averages in a potentially misleading way

6. Stability gating is off by default and warnings do not affect scores

7. --threads 1 default may surprise users expecting peak performance

8. OpenCV-vs-OpenVX comparison framing is ambiguous

9. omp_get_max_threads detection via dlsym(RTLD_DEFAULT) is fragile

10. CI does not exercise the in-process verification path

11. Impl-specific numerical issues are handled by workarounds, not root-cause fixes

12. test_data_generator.cpp uses the same seed for every benchmark

13. ResourceTracker silently ignores release errors

14. “Vision 42/42” means benchmark coverage, not CTS conformance

15. opencv-mark still has 4 unmatchable vision variants

16. No built-in regression baseline comparison

17. vx_perf.median_ns is actually avg

Top 5 recommended fixes (in order of impact)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

2. Many `verify_fn` implementations are weak correctness signals

3. `immediate_func` coverage is incomplete

7. `--threads 1` default may surprise users expecting peak performance

9. `omp_get_max_threads` detection via `dlsym(RTLD_DEFAULT)` is fragile

12. `test_data_generator.cpp` uses the same seed for every benchmark

13. `ResourceTracker` silently ignores release errors

15. `opencv-mark` still has 4 unmatchable vision variants

17. `vx_perf.median_ns` is actually `avg`