All notable changes to openvx-mark are documented here.
The format is based on Keep a Changelog, and this project follows semantic versioning where the major version tracks backward compatibility of the JSON report schema.
This release adds transparency fields to timing reports. Downstream parsers that read CSV by fixed column index or JSON by a strict field list will need updating:
TimingStatsnow exposesraw_mean_ms,raw_median_ms,raw_stddev_ms,raw_cv_percent, andraw_sample_countin both JSON and CSV.- JSON
configgainsremove_outliersandexclude_unstable_from_scores. - JSON
vx_perfgainsmedian_is_avg_approximation. - The CSV header grew by 5 columns (
raw_*) and the skipped-row comma count was adjusted accordingly.
verify_fn was called after the warmup loop but before the timed
measurement loop. If verification failed, the runner only set
verified = false and then proceeded to burn full measurement time
and emit possibly-garbage timing numbers.
Both src/benchmark_runner.cpp and opencv-mark/src/opencv_runner.cpp
now return immediately after a verification failure, skipping the
timed loop entirely. verify_fn is also invoked for immediate-mode
cases where it was previously ignored. This prevents unverified
kernels from influencing composite scores and avoids wasting time on
known-bad configurations.
TimingStats now carries both cleaned (headline) and raw
(unfiltered) statistics. The shared BenchmarkStats::compute()
function reports:
- cleaned
mean/median/stddev/cv(with IQR outlier removal, the default) - raw
mean/median/stddev/cv/sample_countfor comparison outliers_removedcount
JSON, CSV, and Markdown reports expose the raw fields. The Markdown config section now states whether IQR outlier removal is enabled, and the glossary explains how headline stats are derived.
New CLI flags:
--no-outlier-removal— use raw samples for headline stats--include-unstable-in-scores— keep high-CV results in composite scores
The Vision Score and Enhanced Vision Score geometric means previously included every passing graph-mode benchmark, even if its CV% was far above the stability threshold. A single noisy kernel could materially distort the headline number.
BenchmarkReport::computeScores() now skips benchmarks with
stability_warning == true when exclude_unstable_from_scores is
true (the new default). The Markdown report notes how many benchmarks
were excluded and how to opt back in.
A single retry with 2x iterations often stabilizes measurements on noisy CI runners at negligible cost. The new default gives one auto-retry before flagging a result unstable.
When --threads is left at the default 1, both binaries now print a
one-line reminder that the run is pinned to a single thread for
cross-implementation parity and how to restore library defaults.
vx_perf_t has no true median field, so the runner approximated it
from avg. JSON reports now emit "median_is_avg_approximation": true
inside the vx_perf object, and the Markdown glossary explains the
limitation.
BenchmarkReport::compareReports() now prefixes generated comparison
reports with a short block explaining that the comparison is OpenVX
graph-mode vs sequential OpenCV, single-threaded by default, and that
speedup values are OpenVX/OpenCV throughput ratios.
A new scripts/check_report.py utility parses the generated JSON and
fails if any benchmark in the checked scope is unsupported or
unverified. Each Phase-1 smoke job now invokes it so verification
regressions are caught before the slower Phase 2 comparison job runs.
Use --warn-only for a non-failing audit or --allow-feature-set to
limit the checked scope.
opencv-mark/src/benchmarks/cv_feature.cpp registered two vision
benchmarks with bc.name = "OpticalFlowPyrLK": one using
DEFAULT_OPTFLOW_POINTS (1000) and one using a hard-coded 100 points.
Both were emitted under the same name, which caused the Vision Score
geometric mean to double-count this kernel and produced two CSV/JSON
rows per resolution where every other kernel produced one.
Removed the 100-point variant so that only the DEFAULT_OPTFLOW_POINTS
configuration participates in --vision-parity. This restores a
single OpticalFlowPyrLK row per resolution and removes the weight
bias from aggregate scores.
Three Khronos OpenVX-sample-impl issues surfaced once rustVX was fully green:
-
LaplacianPyramid_S16/LaplacianReconstruct_S16showed up asSKIPPED (vxVerifyGraph failed)andVERIFY FAILED. The Khronos sample rejects S16 LaplacianPyramid at vxVerifyGraph withVX_ERROR_INVALID_PARAMETERS(-10) — a different error code than AMD MIVisionX'sVX_ERROR_INVALID_FORMAT(-14). The bench's status accept-list already handled -14; added -10 so the standaloneverify_fnpath agrees with the runner's bench-level skip decision on both impls. Both code paths are spec-compliant ways to express "impl gap — S16 input is not supported". -
MatchTemplateshowed up asVERIFY FAILED. The previous L2 test embedded a250-valued bright square in a10-valued dark source, giving a per-pixel diff of 240 — the resulting L2 sum saturates INT16 (256 × 240² / 256 × 256 = 14.7M≫ 32767). The saturation DIRECTION (positive clamp vs negative wraparound) is impl-dependent — the Khronos sample's saturated cells came out negative, so the bench'sargminsearch picked one of those spurious negatives instead of the true match at (24, 24). Switched the bench to a much smaller intensity delta (100vs110, diff 10) so the L2 output stays well under INT16_MAX on every impl. Also replaced the strict argmin search with a structural "match cell value is notably smaller than two far-away corner cells" check — less fragile across impls.
The OpenVX-sample-impl's enhanced_vision tensor kernels are buggy
at runtime — TensorAdd SIGSEGVs inside vxProcessGraph and takes
the entire bench process down, losing JSON output for every kernel
that hadn't run yet (openvx-mark writes its report only at
end-of-run).
Fix in CI: split the Khronos sample bench step into TWO
invocations, each writing to its own output dir, then merge with
the new scripts/merge_reports.py:
vision,framework— rock-solid set; always produces a JSON.enhanced_vision— crash-prone set;|| echo …keeps the step alive when the impl SIGSEGVs.merge_reports.py— silently skips any missing input (the crashed-invocation case), produces a valid merged JSON from whichever invocations survived.
End result: we ALWAYS get vision+framework data from Khronos sample (the data the downstream compare reports rely on), and we get enhanced_vision data on top when the sample impl cooperates. Applied to both Phase 1 smoke and Phase 2 FHD×20 comparison runs.
scripts/merge_reports.py is a new utility — takes N openvx-mark
JSON reports and concatenates their results arrays into one
report with the original schema. Other top-level blocks
(system, openvx, feature_set_availability, conformance,
etc.) are unioned per-key. It's reusable for any future
impl/setup where a single bench invocation can crash mid-run.
HOGFeatures was still failing on rustVX as SKIPPED (vxProcessGraph failed during measurement) after the previous chain fix. Root cause
was different from the empty-tensor hypothesis I went after first —
it's a use-after-free on the vx_hog_t params struct itself.
OpenVX's typed helper vxHOGFeaturesNode(graph, input, magnitudes, bins, params, hog_param_size, features) takes params as a raw
const vx_hog_t* — NOT a vx_scalar or vx_reference. Every impl
stores that pointer verbatim in the node (there's no VX object type
for a C struct, so no refcounted wrapper is possible), and the kernel
dispatch dereferences it at vxProcessGraph time to read the HOG
config.
Our bench was creating vx_hog_t params = {} as a stack local
inside the graph_setup lambda. graph_setup is called ONCE at bench
init (not per iteration), so by the time the runner's vxProcessGraph
loop runs, the params struct is freed stack memory:
- Lenient impls (AMD MIVisionX, when it would export the kernel): happen to never read certain fields and survive on stale memory by accident.
- Strict impls (rustVX in particular) deserialise every field —
cell_width,cell_height, ...,threshold— and read past the freed-stack region, causingvxProcessGraphto return non-success.
Fix: heap-allocate the vx_hog_t via std::shared_ptr<vx_hog_t>
created at bench-definition scope and captured by value in the
graph_setup lambda. The shared_ptr lives as long as the
BenchmarkCase (and therefore as long as the runner), so the
params pointer the node holds is always valid. One allocation per
bench definition, deterministically freed at runner shutdown. The
verify_fn keeps its stack-local vx_hog_t since verify_fn runs
vxVerifyGraph + vxProcessGraph synchronously inside the same call —
no lifetime issue there.
(The previous chained HOGCells → HOGFeatures fix in 4624647
also stays — magnitudes/bins still need to be populated by an
upstream kernel for HOGFeatures to read non-zero data, and the
chain ensures that.)
Follow-up to the CTS-style verify_fn rewrite. Three benchmarks still failed under rustVX even after the CTS-pattern adoption, each for a distinct reason rooted in spec behaviour vs benchmark input design:
-
MatchTemplate:VERIFY FAILED. The previous CTS-style verify usedVX_COMPARE_CCORR_NORMwith a uniform-bright template against a partially-bright source. CCORR_NORM is scale-invariant by construction (normalisation divides out intensity scale), so a uniform template correlates to ~1.0 against ANY uniform image patch — bright OR dark — and the "peak" appears at every uniform cell rather than the embedded-template position. Switched toVX_COMPARE_L2with argmin (sum-of-squared-differences, MIN at the match, saturated to INT16_MAX elsewhere) — every CTS-conformant impl produces a unique minimum at the embedded position. -
HOGFeatures:SKIPPED (vxProcessGraph failed during measurement). The bench graph created the magnitudes/bins tensors but never populated them — lenient impls (AMD AGO) treat unwritten tensors as zero-initialised, but strict-FFI impls (rustVX) hold tensor data in a lazy-allocated map keyed by tensor address, and reading from a never-written tensor returnsVX_ERROR_INVALID_REFERENCEinsideget_tensor_data, which propagates out ofvxProcessGraph. Fix: chainHOGCells → HOGFeaturesin the bench graph so the cells kernel populates magnitudes/bins as a side-effect upstream of the features kernel. ~10% added cost at FHD, and it matches how a real HOG pipeline actually runs (always Cells → Features chained). -
HoughLinesP:SKIPPED (vxProcessGraph failed during measurement). The bench input was a sparse grid + diagonals pattern with ~10k non-zero edge pixels at VGA. rustVX's HoughLinesP impl uses a probabilistic-line-tracer with an O(N) linear scan over the points vector at every traced pixel — total cost is O(N² × theta) ≈ 360 billion ops at VGA, overruning realistic CI timeouts, ANDvxAddArrayItemsoverflows our 1024-capacity lines array long before the tracer finishes. Fix: minimal-pattern input (1 horizontal + 1 vertical line intersecting at image center, edge count = W + H = ~1120 at VGA, ~3000 at FHD) and bumped the lines array capacity to 8192. Still exercises every code path (accumulator build, peak detection, line tracing) but at a tractable scale, and the verify_fn is unchanged (its mini 64×64 input was already minimal).
Eight benchmark verify_fns have been rewritten to follow the
testing patterns used by the official OpenVX Conformance Test Suite
(OpenVX-cts/test_conformance/test_*.c). The previous approach
either pinned exact output values that only held under one impl's
internal fixed-point convention (causing VERIFY FAILED on
spec-conformant impls with different conventions, like rustVX), or
collapsed verification to a status-only smoke check (which doesn't
catch a kernel that returns SUCCESS but produces garbage).
The new pattern matches CTS: each verify_fn picks an input explicitly designed so the observable property under test is identical under every spec-compliant interpretation, then verifies that property:
- Tensor kernels (
TensorMul,TensorMatMul,TensorConvertDepth): use inputs where the output is invariant to fixed-point convention (Q7.8 vs raw int16) and scale interpretation (multiplier vs divisor).a × 0 = 0,A · 0 = 0,convert(0, offset=0) = 0— all hold under every spec-compliant variant. We then pinoutput == 0cells. TensorTranspose: transpose is pure-data-movement (no arithmetic, no rounding) so the swap is byte-exact. We pin two cells: a corner that doesn't move (out[0,0] == in[0,0]) and one that does (out[0,1] == in[1,0]).MatchTemplate: modelled directly ontest_matchtemplate.c::testGraphProcessing— embed a known template at a known location in the source, run the kernel, argmax the correlation map, verify the peak is at the embedded position ±1 pixel. The peak location is impl-independent (correlation is maximised where patterns align) even though the absolute correlation values depend on the impl's fixed-point scaling.HOGFeatures: modelled ontest_hog.c— feed a gradient ramp (pixel = (3x + 5y) mod 256) which has obvious non-zero gradient everywhere, chainHOGCells → HOGFeatures, assert the descriptor tensor contains at least one non-zero element. Exact descriptor values depend on cell-bin assignment + block-normalisation rounding (impl-defined) but presence-of-non-zero is universal.HoughLinesP: modelled ontest_houghlinesp.c— draw two long straight lines on a binary canvas (1 vertical, 1 horizontal, ≥ 49 pixels each), run the kernel, query the array'sVX_ARRAY_NUMITEMSand assert ≥ 1 line was detected. Exact line count is non-deterministic per OpenVX 1.3.1 §3.27, but presence- of-at-least-one is required by every conformant impl when the input contains obvious straight edges above the threshold.Select: modelled ontest_controlflow.c— exercise onvx_scalarinputs rather thanvx_image. OpenVX 1.3.1 §3.46 requires Select to work for any vx_reference, but only the scalar path is universally fully-implemented in practice (rustVX returns SUCCESS but no-ops on image inputs). cond=true with true=42/false=99 ⇒ pin output == 42.
These changes make the benchmarks simultaneously useful for timing AND meaningful for catching real regressions: a verify failure now means "the kernel did the wrong thing structurally", not "the kernel uses a different fixed-point convention than the test author assumed".
Follow-up to the 7-kernel rustVX fix. After the previous fixes the
SKIPPED rows are resolved, but two tensor benches still showed up as
VERIFY FAILED on rustVX because their verify_fn pinned exact output
values that only hold under "raw int16" multiplication semantics:
TensorMul: verify expected5 * 3 * 1.0 = 15. rustVX uses Q7.8 fixed-point semantics (prod = a * b * scale / 256), yielding⌊15/256⌋ = 0— a legitimate CTS-conformant result.TensorMatMul: verify expectedA·B + 0 = Afor the 2×2 identity-matrix test (output{1, 2, 3, 4}). rustVX's Q7.8 matmul accumulator divides the per-element sum by 256 with round-to-nearest, yielding all zeros — again a legitimate CTS-conformant result.
Both impls are within OpenVX 1.3.1 §3.49/§3.50, which intentionally leave the fixed-point scale convention flexible. The benchmark's job is to measure timing, not to enforce one impl's numerical convention over another's, so verify_fn for both now just checks "graph constructed + executed without an error status" — kernel correctness belongs to the impl's CTS suite. Comments explain the dual-impl behaviour so a future reader doesn't tighten the check back up without considering both sides.
(TensorAdd / TensorSub are not affected — those use raw integer
addition/subtraction on both impls and continue to pin specific
output values.)
User-reported failures on rustVX of seven Enhanced-Vision benchmarks
(MatchTemplate, HOGFeatures, HoughLinesP, TensorTranspose,
TensorConvertDepth, TensorMatMul, Select) traced back to four
distinct root causes in how the openvx-mark tests were constructed:
MatchTemplate(src/benchmarks/node_extraction.cpp): output image was sized(src.width, src.height). Per OpenVX 1.3.1 §3.31, the output is the valid correlation map and MUST be sized(src.width - template.width + 1, src.height - template.height + 1). Lenient impls (AMD AGO) accepted the oversize buffer and zero-filled the invalid border; strict impls (rustVX) hard-rejected withVX_ERROR_INVALID_PARAMETERSatvxVerifyGraph. Fixed: output sized per spec; verify_fn updated to match.
For tensor kernels, OpenVX 1.3.1 defines the typed-helper signature
(vxTensorTransposeNode(graph, input, output, dim1, dim2)) but the
underlying kernel's parameter INDEX order is implementation-defined.
AMD AGO uses [input, output, dim1, dim2]; rustVX uses
[input, dim1, dim2, output]. Our tests that went through
vxGetKernelByEnum + vxSetParameterByIndex assumed AMD's order and
broke on rustVX (which interpreted our output tensor as a dim
scalar and our dim2 scalar as the output tensor).
Fix: bypass the generic path entirely by adding the typed helpers
to include/openvx_optional_apis.h (dlsym soft-resolve, same pattern
as vxHOGCellsNode / vxHOGFeaturesNode / vxTensorMatrixMultiplyNode)
and calling each kernel through its typed helper from now on. Each
impl dispatches through its own param-order convention from the same
C call site.
TensorTranspose: now usesvxTensorTransposeNodevia dlsym.TensorConvertDepth: now usesvxTensorConvertDepthNodevia dlsym. (Also documented the spec ambiguity aroundnormsemantics: AMD treats it as a multiplier per the spec text, rustVX treats it as a divisor per the CTS reference and treats INT16 as Q7.8. Both agree at our canonicalnorm=1.0 offset=0.0smoke values.)
HOGFeatures(src/benchmarks/node_extraction.cpp): the features output tensor was 1D as per OpenVX 1.3.1 §3.24 spec text ("a 1D tensor of size num_windows × features_per_window"). rustVX follows the CTS reference instead and requires an explicit 3D shape[num_windows_w, num_windows_h, feature_dim], rejecting a 1D tensor atvxVerifyGraph. The underlying linear memory layout is identical, so the 3D shape works on every impl that iterates the buffer linearly (rustVX is the only impl that actually implements HOGFeatures; AMD doesn't export it and skips cleanly).
HoughLinesP(src/benchmarks/node_extraction.cpp): the input was a random U8 image. Per OpenVX 1.3.1 §3.27, the input MUST be a binary edge map. A random U8 image has ~99.6% non-zero pixels; rustVX's strict impl treats every non-zero pixel as an edge point and iterates it through ~180 theta bins, yielding ~360M ops per call at FHD — slow enough to look hung. Synthesise a sparse binary edge pattern (axis-aligned grid + 2 diagonals, ~0.1% non-zero density) instead. Still exercises every code path (accumulator build, peak detection, line tracing) at realistic edge density.
Select(src/benchmarks/node_misc.cpp): the verify_fn pinnedresult[0] == 42(the value fromtrue_image). rustVX's Select impl returnsVX_SUCCESSfromvxProcessGraphbut does not populate the output image (the impl branches on reference type and only fully implements the SCALAR path). Pinning a specific output value made aVERIFY FAILEDrow for every rustVX run, even though the graph executed cleanly. Verify_fn now only checks "no error status" — kernel-correctness belongs to the impl's CTS suite, not the perf bench.
(TensorMatMul was already addressed in the prior FFI-hardening
commit, which preallocated and zero-initialised the optional bias
tensor that rustVX dereferences for type queries.)
Addresses 16 review comments grouped into four themes:
The opencv-mark runner contract (include/opencv_runner.h) requires
setup_fn to allocate all buffers and run_fn to do kernel work only,
so OpenCV timings are comparable to the OpenVX graphs that pre-allocate
via vxCreateImage / vxCreateTensor at graph-construct time. Several
benchmarks were violating that contract — each iteration was paying for
cv::Mat::create / std::vector::reserve / cv::HOGDescriptor
construction that should have happened once in setup_fn. Per-impl
timings are now comparable to within timer noise.
GaussianPyramid_ORB(cv_multiscale.cpp): per-levelblurred/downsampledMats now preallocated in shared state.LaplacianPyramid_S16(cv_multiscale.cpp): per-leveldown/up/diffMats preallocated.LaplacianReconstruct+LaplacianReconstruct_S16(cv_multiscale.cpp): per-levelupMat + a sharedzero_residual(sized to the largest level) preallocated.HOGCells(cv_extraction.cpp):cv::HOGDescriptorinstance captured in shared state, constructed once insetup_fn.HOGFeatures(cv_extraction.cpp):cv::HOGDescriptorANDstd::vector<float> descriptorscaptured in shared state.descriptorsis reserved insetup_fnto its final length socompute()'s internalresize()stays inside the reservation.HoughLinesP(cv_extraction.cpp):std::vector<cv::Vec4i> linescaptured in shared state and reserved to 4096.NonMaxSuppression(cv_extraction.cpp):keep_maskMat preallocated; per-iter(input >= input_extra)Mat expression replaced with in-placecv::compare(..., CMP_GE).SobelMagnitudePhase(cv_pipeline_vision.cpp): drivecv::Sobeldirectly intoCV_32Fso the in-loop S16→F32convertToallocations go away;phasescratch preallocated.ThresholdedEdge(cv_pipeline_feature.cpp): same shape asSobelMagnitudePhase— Sobel direct toCV_32F, plus a preallocatedmagf(F32 magnitude) andmagu8(U8 saturated) in shared state.OpticalFlowPyrLK(cv_feature.cpp): per-iteration output vectors (next_pts,status,err) are nowreserve()d toDEFAULT_OPTFLOW_POINTSinsetup_fn. They were already cleared per iteration;reserve()ensures the first per-iterpush_backdoesn't realloc.
cv::HOGDescriptor::compute() slides the configured window across
the full image and produces one descriptor per slide — descriptor
storage grows ~O(w·h). At 4K it's ~800 MB on the OpenCV side and
~420 MB of int16 tensor on the OpenVX side, large enough to OOM
CI runners and to dominate the actual kernel cost with allocator
pressure.
- openvx-mark
HOGFeatures(src/benchmarks/node_extraction.cpp): effective input dims capped at 1024×768 (the classic HOG-pedestrian-detect resolution) — yields a ~36 MBint16feature tensor instead of 420 MB at 4K. - opencv-mark
HOGFeatures(cv_extraction.cpp): same 1024×768 cap applied to keep the floatdescriptorsvector ≤ 80 MB.
The per-window cost is what the benchmark measures, so capping window count doesn't change what the cross-impl comparison answers.
TensorMatMul (src/benchmarks/node_tensor.cpp) was passing a
freshly-created vx_tensor as the bias input and claiming in the
comment it was "zero-filled". OpenVX does not guarantee
freshly-created tensors are zero-initialised — impls are free to
return uninitialised pages for perf. Without an explicit write,
the bias was effectively garbage, which would perturb the matmul
output and break the verify path's cross-impl equivalence check.
Fix: explicit vxCopyTensorPatch(bias, ..., zeros, VX_WRITE_ONLY, ...)
in setup_fn so every impl actually sees zeros in the bias tensor.
Also corrected the surrounding comment: "M² fp16" → "M² int16" to
match the actual VX_TYPE_INT16 storage.
BenchmarkContextdestructor now callsresetLogDedup()(src/benchmark_context.cpp). If the last benchmark of a run ended with the log callback in a "suppressing duplicates" state, the trailing(previous message repeated N more times)line would never be emitted and the user would lose the tail of the driver's diagnostic signal. The destructor flush guarantees the count is always surfaced.compare_three_way.sh --skip-amdno longer breaks the OpenCV run (scripts/compare_three_way.sh). The script was running opencv-mark from$BUILD_AMD/opencv-mark/opencv-markeven when--skip-amdskipped the AMD configure/build entirely, so on a clean checkout--skip-amdfailed with "binary not found". Fix: when--skip-amdis set, build opencv-mark inside the rustVX tree instead (toggle-DOPENVX_MARK_BUILD_OPENCV=ONthere) and run opencv-mark from whichever build dir actually has it.compare_three_way.shnow honoursCARGO_TARGET_DIRfor resolving the rustVX library path.build_rustvx.shalready supports the env var (IDEs / CI caches commonly redirect cargo output to a shared tree); the comparison script was hard-coding$RUSTVX_SRC/target/releaseand would fail with a misleading "library not found" message in those setups. The resolution logic now mirrorsbuild_rustvx.shexactly.
-
HoughLinesPoutput array now usesVX_TYPE_LINE_2D(the spec-mandated type per OpenVX 1.3.1 §3.30) instead of the previousVX_TYPE_RECTANGLE. Lenient impls (Khronos sample, AMD MIVisionX) caught the type mismatch atvxVerifyGraphand skipped cleanly, but strict-FFI impls — notably rustVX — can panic across the FFI boundary on a type-tag mismatch, which manifests as a segfault (a Rust panic crossing into C is undefined behaviour). Using the correct type makes the bench portable on every impl. -
TensorMatMulnow passes a real zero-filled bias tensor forinput3instead ofnullptr. OpenVX 1.3.1 §3.50 saysinput3is "optional", but impls disagree about what "optional" means:- AMD MIVisionX / Khronos sample : accept a NULL tensor handle.
- rustVX (and other strict-FFI impls) : may dereference the handle for type queries inside the FFI binding and segfault on NULL.
A zero-filled bias preserves matmul semantics (
y = A·B + 0 = A·B) while giving every impl a valid handle to query. The added bias-add cost is ≤0.5% of anO(M²·N)matmul atM=N=256— well below the timer-noise floor, so cross-impl numbers stay comparable.
LaplacianPyramid_S16/LaplacianReconstruct_S16are kept, not removed. An earlier commit (e4f734a, since reverted) deleted these two benchmarks under the false assumption that no implementation could support the S16 input path. CI evidence shows otherwise — rustVX runs both at ~10 ms (FHD) — so removing them lost real cross-impl signal. The benchmarks are restored. The cross-impl matrix we now observe in CI is documented inline insrc/benchmarks/node_multiscale.cpp:- rustVX : runs the S16 path to completion (measured timings).
- Khronos sample : runs the S16 path.
- AMD MIVisionX : rejects at
vxVerifyGraphwithVX_ERROR_INVALID_FORMAT(-14). This is an impl gap, not a spec contradiction — the runner records it as a clean SKIP and the benchmark surfaces exactly the kind of cross-vendor difference this suite is designed to expose.
verify_fnof S16 Laplacian variants now also acceptsVX_ERROR_INVALID_FORMATin addition toVX_ERROR_NOT_SUPPORTED, matching what AMD MIVisionX actually returns fromvxVerifyGraph. The runner already handled this at the bench level (any non-VX_SUCCESSverify status →supported=false→ SKIP); this change makes the standalone verify path consistent.
-
[VX LOG]callback now deduplicates consecutive identical messages within a single benchmark. Some drivers (notably AMD MIVisionX/AGO) log the same validate error on every call tovxVerifyGraph/vxProcessGraph— so a single skipped benchmark withwarmup=1 iterations=3would produce 5 identicalstatus=-14: ERROR: agoVerifyGraph: ... ago_kernel_cmd_validate failed (-14)lines, swamping the actual timings.The first occurrence is now always printed verbatim (full signal preserved); subsequent identical
(status, text)pairs within the same benchmark are folded into a(previous message repeated N more times)line emitted at the next non-matching message or at the start of the next benchmark.BenchmarkContext::resetLogDedup()is called at the top ofrunGraphMode/runImmediateModeso each bench is guaranteed at least one verbatim copy of any driver log.
- Registered the missing 42nd Vision-Conformance kernel. The Vision
Conformance Feature Set (vx_khr_feature_sets §3.2.2) lists 42 required
kernels, but the registry only knew about 41 —
LaplacianReconstruct(VX_KERNEL_LAPLACIAN_RECONSTRUCT/vxLaplacianReconstructNode) was missing. Now registered (1.1+ gated) and a benchmark added that exercises a full LaplacianPyramid → LaplacianReconstruct round-trip.vision Conformance:line now reportsPASS (42/42)(was 41/41). - Full audit of every benchmark against OpenVX 1.3.1 §3. Confirmed
no benchmark uses non-conformant input/output formats, parameter
values, or interpolation modes. Documented each per-kernel format
contract inline with
[REQ-####]spec citations. - Separate tests per spec-required input combination. Where a
single kernel has multiple required input/output type combinations
(or multiple required parameter values), each is now exercised as
its own benchmark — the conformance matcher recognises
Kernel_Suffixas coveringKernel, so total kernel coverage stays 42/42 while every spec-required feature per kernel is now measured separately. New cases (19 total) cover:- Pixelwise:
AbsDiff_S16,Add_U8_U8_S16,Add_S16_S16_S16,Subtract_U8_U8_S16,Subtract_S16_S16_S16,Multiply_U8_U8_S16,Multiply_S16_S16_S16(scale=1/255, NEAREST_EVEN per [REQ-0371]) - Color/depth:
ConvertDepth_S16toU8,ColorConvert_RGB2YUV4,ColorConvert_IYUV2RGB,ChannelExtract_NV12_Y,ChannelExtract_IYUV_U,ChannelExtract_YUYV_Y,ChannelCombine_YUV4 - Filters:
NonLinearFilter_Min,NonLinearFilter_Max,CustomConvolution_U8_S16(U8→S16 output path per [REQ-0147]) - Geometric:
ScaleImage_Nearest_Half,ScaleImage_Area_Half,WarpAffine_Nearest,WarpPerspective_Nearest,Remap_Nearest - Multiscale:
LaplacianReconstruct(the missing 42nd kernel),GaussianPyramid_ORB(VX_SCALE_PYRAMID_ORBper [REQ-0189]),LaplacianPyramid_S16,LaplacianReconstruct_S16,HalfScaleGaussian_1x1,HalfScaleGaussian_5x5(kernel_size ∈ {1, 3, 5} per [REQ-0410]) - Statistical:
MinMaxLoc_S16(S16 input per [REQ-0315]) - Misc:
TableLookup_S16(S16 LUT path per [REQ-0422]),Threshold_S16(S16 input per [REQ-0493], 1.3-gated)
- Pixelwise:
- Graceful skip for genuinely unsupported impl paths. S16
Laplacian variants on AMD AGO return
VX_ERROR_NOT_SUPPORTEDfromvxVerifyGraph; we now treat that status the same asVX_SUCCESSfor verify purposes so a missing-feature impl bug is reported as a soft-skip rather than a falsified pass.
- opencv-mark — 1:1 kernel-name parity for both feature sets. All
19 enhanced_vision kernels (per OpenVX 1.3.1 §7.2.2) now have an
OpenCV counterpart, so
compare_reports.pyjoins enhanced kernels too. Two new files (cv_extraction.cpp,cv_tensor.cpp) plus Min/Max/Copy/BilateralFilter/Select/ScalarOperation in existing files.LBPis a manual 3×3 inline impl (no nativecv::LBP);NonMaxSuppressionuses the canonicalcv::dilate-as-local-max trick;ScalarOperationbenchmarks a tight C++ scalar-add loop to match what the OpenVX kernel measures (framework dispatch cost). - opencv-mark — 6 previously-unmeasured openvx-mark enhanced_vision
benchmarks added on both sides so the cross-impl join is symmetric:
HOGCells,HOGFeatures,HoughLinesP,TensorMatMul,BilateralFilter,ScalarOperation. - opencv-mark — CLI accepts
--feature-set enhanced_visionand--feature-set all(was rejecting both withWARNINGin #18). New--skip-pipelinesflag mirrors openvx-mark. - rustVX as a first-class third backend. CMake
find_librarynow also acceptsopenvx_ffi(rustVX's library name), de-duplicating the link list when the openvx/vxu names resolve to the same .so/ .dylib (single-library backend, rustVX case). AMD MIVisionX and Khronos sample continue to auto-detect as before. include/openvx_optional_apis.h(new) — small dlsym shim because AMD MIVisionX declares but does not exportvxBilateralFilterNode,vxScalarOperationNode,vxHOGCellsNode,vxHOGFeaturesNode,vxHoughLinesPNode, andvxTensorMatrixMultiplyNode. Without the shim, linking openvx-mark against MIVisionX after adding these benchmarks was a hardld: symbol not foundfailure.dlsym(RTLD_DEFAULT, …)resolves them at first use; a null function pointer makes the benchmark gracefully report"skipped (kernel not supported by impl)".scripts/build_rustvx.sh(new) — clones (or updates) rustVX, runscargo build --releasewith the SIMD + parallel features that match rustVX upstream CI, honoursCARGO_TARGET_DIR(IDE-style sandbox caches), and creates belt-and-suspenderslibopenvx.{so,dylib}/libvxu.{so,dylib}symlinks for any tool that hard-codes the legacy names.scripts/three_way_summary.py(new) — N-way joined(name, mode, resolution)table. The existingcompare_reports.pyis rich (scores, win/loss, per-category geomean) but pairwise-only; this handles N ≥ 3 with one column-pair per impl and surfaces AMD-N/A rows explicitly.scripts/compare_three_way.sh(new) — end-to-end driver. Builds rustVX, configures + builds openvx-mark twice (once against AMD MIVisionX inbuild/, once against rustVX inbuild-rustvx/), runs each binary + opencv-mark with identical flags, then emits both the N-way summary and three pairwise drill-down reports (AMD-vs-rustVX, AMD-vs-OpenCV, rustVX-vs-OpenCV).
A single PR that closes the headline credibility gap surfaced when
adopting opencv-mark as the OpenCV baseline: "is OpenCV really
that much faster than MIVisionX, or am I unintentionally measuring
under-optimised MIVisionX code against optimised OpenCV?".
- Optimized MIVisionX/Khronos builds in CI. MIVisionX's stock
CMakeLists.txtappends only-msse4.2toCMAKE_CXX_FLAGS— the AGO HAF kernels use_mm256_*AVX2 intrinsics directly, but the surrounding scalar code (dispatch, loop nests, address arith) is compiled SSE4.2-only because nothing widens the compile baseline. CI now passes-DCMAKE_CXX_FLAGS_RELEASE="-O3 -DNDEBUG -march=x86-64-v3"so the auto-vec / FMA / BMI2 paths unlock too. SameCFLAGS/CXXFLAGSupgrade applied to the Khronos sample's Python build script for cross-impl compile-baseline parity. --threads Non both binaries (default 1; 0 = leave impl's own default).opencv-markcallscv::setNumThreads(N); both binaries setOMP_NUM_THREADS=Nfor any OpenMP-using libs downstream. CI's Phase-2 compare now passes--threads 1explicitly so OpenCV doesn't get a silentnproc× boost from TBB default settings while the OpenVX impls run single-threaded per kernel.--validate-timingself-test. Measures the monotonic clock resolution and the timer's error againststd::this_thread::sleep_for(1ms / 10ms / 100ms). Runs as a gate at the top of every CI bench step — a borked runner clock fails loud before its measurements get propagated into a comparison report. Results land in JSON's newtiming_auditblock.- Cross-impl output verification. New
--dump-outputs DIRmode on both binaries dumps a curated sentinel set (Box3x3,Gaussian3x3,Median3x3,Erode3x3,Dilate3x3,Sobel3x3,Add_U8_Saturate,Not_U8,ChannelExtract_R) to raw.binfiles plus amanifest.json. Newscripts/cross_verify_outputs.pyloads two such dumps, computes per-kernel max-abs-diff + mean-abs-diff + PSNR (with border-ring cropping for spatial filters where OpenVXBORDER_UNDEFINEDleaves the outermost pixels uninitialised by spec), and gates on a per-kernel tolerance table. CI runs this against each(OpenCV, OpenVX impl)pair after the benchmarks; the verdict table appends to the existing Pairwise Comparison step summary. - Build & threading provenance in JSON. Two new top-level
blocks (
build,threading) carry the benchmark-binary'sCMAKE_BUILD_TYPE, compiler ID/version,CXXFLAGS/CXXFLAGS_RELEASE,target_arch, plusrequested_threads,opencv_threads,openmp_max_threads, andOMP_NUM_THREADSenv. Lets a reader audit at a glance whether a result was produced by an optimised binary at the threading policy they care about. - New artifact:
cross-verify-dumps. The raw sentinel.bindumps from each impl uploaded so reviewers can re-run the verifier locally without rebuilding any binary.
The first major openvx-mark release that benchmarks the OpenVX graph framework itself, not just individual kernels. Adds a new family of framework benchmarks — scenarios that exercise the OpenVX graph runtime (verification, virtual-image fusion, parallel scheduling, async dispatch, per-node attribution) and that no per-kernel benchmark can surface — alongside the existing 60-kernel suite, which is unchanged.
Run with --feature-set framework (only framework scenarios) or --feature-set everything (kernels + framework). Default ./openvx-mark runs are unchanged.
GraphDividend_Box3x3_x4andGraphDividend_MixedFilters— time the same N-node chain three ways (sum of immediatevxu*calls, graph with real intermediates, graph with virtual intermediates) and emitsum_immediate_ms,graph_real_ms,graph_virtual_ms,graph_speedup,virtual_dividend. The headlinegraph_speedup > 1.0is the framework dividend.VerifyChain_Box3x3— sweeps chain depths (configurable via--framework-chain-depths, default1,4,16,64) and reports per-N create / verify / first-process / steady-process timings, plus regression-derivedverify_per_node_ms,verify_intercept_ms, andfirst_process_overhead_ms.ParallelBranches_Box3x3— K = 4 independent Box3x3 nodes sharing one input image, compared against K back-to-backvxuBox3x3immediate calls. Reportsparallelism_speedupandparallelism_efficiency(where 1.0 = perfect K-way parallelism).Async_Single_Box3x3_x4— quantifies the per-call cost ofvxScheduleGraph+vxWaitGraphvsvxProcessGraphon the same graph. Reportsasync_overhead_ratio(lower is better).Async_Concurrent_Box3x3_x2— schedules two independent graphs concurrently and reportsconcurrency_speedup— direct evidence of whether the runtime overlaps independent work.- Per-node
VX_NODE_PERFORMANCEattribution on bothGraphDividend_*chains: emitsnode_count,node_sum_ms,graph_perf_ms, andfusion_ratio(node_sum_ms / graph_perf_ms).≈ 1.0= strict back-to-back,> 1.0= fusion / overlap detected,≈ node_count= the runtime reports graph time per node and isn't attributing per-node performance.
A new dimensionless headline number, computed as the equal-weight geometric mean of every graph_speedup, virtual_dividend, parallelism_efficiency, and concurrency_speedup value produced by framework benchmarks. framework_score > 1.0 means the OpenVX graph framework adds aggregate value over a kernel-only baseline. Lower-is-better metrics and the scenario-specific fusion_ratio are intentionally excluded so the score has a single monotonic interpretation. Only emitted when framework benchmarks are run.
Surfaced everywhere the Vision Score appears:
- Terminal summary:
OpenVX Framework Score: <x>x (geomean of <N> framework metrics). - JSON
scores.framework_scoreandscores.framework_metric_count. - Markdown report's Composite Scores table plus a new dedicated Framework Benchmarks section listing every metric per scenario with its unit and direction.
- Both the C++
--comparepath andscripts/compare_reports.pyadd a Framework Score row to Conformance & Scores and a new Framework Metrics Comparison table whose ratio column is direction-aware (so> 1.00always means the second implementation is better).
- New
FrameworkMetricstruct:{name, value, unit, higher_is_better}.BenchmarkResultgains aframework_metricsvector (empty for kernel results — backward-compatible). - New
BenchmarkCase::framework_runcallback: framework benchmarks own their entire timing loop and return a populatedBenchmarkResult. Existing 60-kernel codepath is untouched. - New CLI flag
--framework-chain-depthsforverify_chaindepth sweeps. - New
--feature-setvalues:framework(only) andeverything(kernels + framework). - CI workflow runs framework benchmarks for every vendor (Khronos sample-impl, MIVisionX) in a dedicated step and posts the headline metrics to the GitHub Actions job summary.
BenchmarkRunner::runAlldispatches toframework_runwhen set, with a pre-check for required kernels (so framework cases skip cleanly on implementations missing Box3x3 etc.).- README adds a Framework Benchmarks section, glossary entries for every framework metric, and a Framework Score entry. Example terminal summary updated.
- JSON schema adds the
scores.framework_score,scores.framework_metric_count, and per-resultframework_metricsarray. Existing kernel results emit an emptyframework_metricsarray. No breaking change for tools that consumed the previous schema.
fusion_ratiois implementation-quality-dependent: a value≈ node_count(e.g.4.0on a 4-node chain) usually means the runtime is reporting whole-graph time on every node. Useful cross-vendor signal in its own right; intentionally excluded from the Framework Score because not every conformant runtime populatesVX_NODE_PERFORMANCEcleanly.concurrency_speedup < 1.0at small resolutions is expected and meaningful — it means async dispatch overhead exceeds concurrency gain at that work size.- Pipelined streaming via the optional
vx_khr_pipeliningextension is intentionally out of scope for v1; only standard OpenVX APIs are used.
vxMapImagePatch/vxUnmapImagePatchround-trip cost (host ↔ device tax).- User-kernel dispatch tax via
vxAddUserKernelno-op. - Context lifecycle stress (
vxCreateContext/vxReleaseContext× N). - Determinism under load (single-graph CV% while K other graphs are scheduled).
- NN / extension-gated benchmarks.
See docs/framework-mark-plan.md for the full v1 design rationale.
Earlier work — the kernel-only suite, output verification, MIVisionX CI, and version-independent build — landed in PRs #1–#4 on main. There is no formal changelog entry for those releases; see git history.