Status: v1 shipped. All four v1 scenarios plus the OpenVX Framework Score and the first v2 backlog item (per-node
VX_NODE_PERFORMANCEattribution) are merged. SeeCHANGELOG.mdfor the release summary. This document is preserved as the design rationale and tracks remaining v2 backlog items.
A concrete plan for adding a framework benchmark suite to openvx-mark that measures what only a graph framework can do — graph construction/verification cost, the "graph dividend" vs immediate mode, virtual-image savings, parallel/heterogeneous scheduling, async/streaming throughput, per-node attribution, user-kernel dispatch tax, and lifecycle costs.
This complements (not replaces) the existing per-kernel suite.
| PR | Slice | Status | Key artifact |
|---|---|---|---|
| #5 | Plumbing — FrameworkMetric, framework_run, --feature-set framework |
✅ merged | include/benchmark_stats.h, include/benchmark_runner.h |
| #6 | graph_dividend (Box3x3×4 + MixedFilters chains) |
✅ merged | src/benchmarks/framework_benchmarks.cpp |
| #7 | verify_chain (depth sweep + linear regression) |
✅ merged | --framework-chain-depths flag |
| #8 | parallel_branches (K = 4 independent Box3x3) |
✅ merged | parallelism_efficiency metric |
| #9 | async_streaming (single-graph overhead + concurrent multi-graph) |
✅ merged | concurrency_speedup metric |
| #10 | OpenVX Framework Score + cross-vendor comparison | ✅ merged | framework_score in JSON / Markdown / terminal / comparison |
| #11 | Per-node VX_NODE_PERFORMANCE attribution (first v2 item) |
✅ merged | node_count, node_sum_ms, graph_perf_ms, fusion_ratio |
| #12 | Rollup → kg/framework-mark |
✅ merged | (no new code) |
Goal. Add a framework benchmark suite that exercises the OpenVX graph runtime itself — the orchestration layer that differentiates OpenVX from a kernel library. The current suite measures vxProcessGraph of single-node graphs and short pipelines, which captures kernel performance but not framework value.
Non-goals.
- Not a replacement for the OpenVX-CTS conformance suite.
- Not a cross-framework benchmark (OpenVX vs OpenCV vs CUDA). That is a separate project.
- v1 will not require any vendor extensions. Anything that needs an extension (pipelining, NN, targets) is gated and clearly labeled.
A framework benchmark is not "single kernel @ resolution → MP/s". It is a scenario that produces one or more named scalar metrics.
| Scenario | Reported metric(s) | Unit |
|---|---|---|
verify_chain (chain of N Box3x3 nodes, N ∈ {1,4,16,64}) |
verify time, ms/node slope, first-process overhead | ms, ms/node |
graph_dividend (4-node chain) |
sum(immediate), graph latency, speedup, per-node overhead | ms, × ratio |
virtual_intermediate |
real-buffer time, virtual-buffer time, dividend | × ratio |
parallel_branches |
serial baseline, scheduled time, parallelism efficiency | × ratio |
async_streaming |
sync latency, async sustained throughput | ms, FPS |
user_kernel_overhead |
per-call dispatch cost above a no-op host function | ns/call |
context_lifecycle |
create+release time, build+teardown @ N | ms |
A framework benchmark therefore needs to emit a small set of typed metrics rather than a single MP/s value. That is the only real model change; everything else is additive.
The four scenarios with the highest "value per line of code" and the cleanest cross-vendor story:
verify_chain— graph build + verify cost vs N.graph_dividend— N-node chain timed three ways: sum of immediatevxu*calls, graph with real intermediates, graph with virtual intermediates.parallel_branches— DAG with K independent branches feeding a join; default scheduler vs all-pinned-to-one-target (when targets are exposed via extension; otherwise just default + serial reference).async_streaming— syncvxProcessGraphloop vsvxScheduleGraph+vxWaitGraphloop (and pipelined queue if the impl supportsvx_khr_pipelining).
Each runs across the same --resolution set as kernels, so results scale with image size.
Per-node— landed in PR #7 asVX_NODE_PERFORMANCEattribution (infer fusion whensum(node_perf) > graph_perf).node_count/node_sum_ms/graph_perf_ms/fusion_ratioongraph_dividendresults.vxMapImagePatch/vxUnmapImagePatchround-trip cost (host↔device tax).- User-kernel dispatch tax via
vxAddUserKernelno-op. - Context lifecycle stress (
vxCreateContext/vxReleaseContext× N; same graph built/torn down × N). - Determinism under load (single-graph CV% while K other graphs are scheduled).
- NN / extension-gated benchmarks.
include/benchmark_stats.h — extend BenchmarkResult (additive, won't break existing JSON/CSV/Markdown consumers):
struct FrameworkMetric {
std::string name; // "verify_ms", "graph_speedup", "parallel_efficiency", ...
double value;
std::string unit; // "ms", "ms/node", "x", "ns/call", "FPS"
bool higher_is_better;
};
struct BenchmarkResult {
// ... existing fields ...
// For framework benchmarks: zero or more named scalars.
// For kernel benchmarks: stays empty.
std::vector<FrameworkMetric> framework_metrics;
// For framework benchmarks the existing megapixels_per_sec/wall_clock
// are interpreted as the "primary" timing if applicable, else 0.
};Pre-existing kernel results emit an empty framework_metrics array. No schema break.
Today BenchmarkRunner::runGraphMode and runImmediateMode are hard-coded to "build graph, warmup, time vxProcessGraph × N." Framework benchmarks need their own execution loop because they may run multiple graphs, time graph construction, or compare two execution modes inside one case.
Cleanest extension to include/benchmark_runner.h:
struct BenchmarkCase {
// ... existing fields ...
// Optional: for framework benchmarks. If set, this is called instead of
// graph_setup / immediate_func, and is fully responsible for timing.
using FrameworkRunFn = std::function<BenchmarkResult(
vx_context ctx, const Resolution& res,
const BenchmarkConfig& cfg, TestDataGenerator& gen)>;
FrameworkRunFn framework_run;
};In BenchmarkRunner::runAll, add a branch: if bc.framework_run is set, call it (skip graph mode / immediate mode); the returned BenchmarkResult already carries framework_metrics.
This keeps the existing 60-kernel codepath untouched.
One file with a registerFrameworkBenchmarks() function returning all v1 cases, mirroring pipeline_vision.cpp. Each case implements its own framework_run lambda. Sketch:
// graph_dividend: 4-node chain (Gaussian -> Sobel -> Magnitude -> Threshold)
BenchmarkCase bc;
bc.name = "GraphDividend_4node";
bc.category = "framework_dividend";
bc.feature_set = "framework";
bc.required_kernels = { VX_KERNEL_GAUSSIAN_3x3, VX_KERNEL_SOBEL_3x3,
VX_KERNEL_MAGNITUDE, VX_KERNEL_THRESHOLD };
bc.framework_run = [](vx_context ctx, const Resolution& res,
const BenchmarkConfig& cfg, TestDataGenerator& gen)
-> BenchmarkResult {
BenchmarkResult r = makeFrameworkResult("GraphDividend_4node", res);
double t_imm = timeImmediateChain(ctx, res, cfg, gen); // sum vxu*
double t_graph = timeGraphChain(ctx, res, cfg, gen, /*virtual=*/false);
double t_virt = timeGraphChain(ctx, res, cfg, gen, /*virtual=*/true);
r.framework_metrics = {
{"sum_immediate_ms", t_imm / 1e6, "ms", false},
{"graph_real_ms", t_graph / 1e6, "ms", false},
{"graph_virtual_ms", t_virt / 1e6, "ms", false},
{"graph_speedup", t_imm / t_graph, "x", true},
{"virtual_dividend", t_graph / t_virt, "x", true},
};
r.wall_clock.median_ns = t_virt; // primary timing = best graph form
r.megapixels_per_sec = BenchmarkStats::computeThroughput(
res.width, res.height, t_virt);
return r;
};The four v1 scenarios are each one ~50–80 line lambda. Helper functions (timeImmediateChain, timeGraphChain, timeAsyncLoop) live in the same file.
Minimal additions, no breaking changes:
- New feature-set value:
--feature-set framework. --allkeeps current meaning (vision,enhanced_vision); introduce--feature-set everything(or--include-framework) for the kitchen-sink run so default kernel runs aren't perturbed.- New category strings:
framework_compile,framework_dividend,framework_parallel,framework_async(so users can--category framework_dividend). - New flag:
--framework-chain-depths 1,4,16,64(overrides the N-vector forverify_chain). --list-frameworkprints the framework scenarios with one-line descriptions.
registerFrameworkBenchmarks() is added to runner.addCases(...) in main only when the user opts in (or when everything is selected).
- JSON. Each result already serializes; just add
"framework_metrics": [{name, value, unit, higher_is_better}, ...]. Pre-existing kernel results emit an empty array. No schema break. - CSV. Long-form rows — emit one row per
(benchmark, metric)for framework benchmarks; existing kernel rows unchanged. Long-form is safer than wide-form with empty columns. - Markdown. New section "Framework Benchmarks" with one table per scenario showing each metric per resolution, plus a one-line interpretation under each table (e.g., "graph_speedup > 1.0 means the graph form beats summing immediate-mode calls").
- Composite score. Add an OpenVX Framework Score = geomean of
graph_speedup,virtual_dividend,parallel_efficiency,async_speedupacross resolutions (only for those that produced valid values). Print alongside Vision Score in the terminal summary. Do not fold framework numbers into the existing Vision Score — keep the two scoreboards separate so existing comparisons stay valid. - Comparison (
compareReports). Extend the diff to print framework metrics side-by-side (vendor A vs vendor B, % delta).
One line:
src/benchmarks/framework_benchmarks.cppadded to BENCHMARK_SOURCES. No new dependencies.
- New "Framework Benchmarks" section explaining the philosophy: "kernel scores measure the kernel; framework scores measure what OpenVX adds as a framework."
- Glossary entries for:
graph_speedup,virtual_dividend,parallel_efficiency,async_speedup,verify_per_node_ms,OpenVX Framework Score. - Explicit caveat: framework metrics are most useful intra-vendor / cross-version; cross-vendor framework comparison still requires interpretation since vendors differ in target exposure.
| PR | Scope | Mergeable independently? |
|---|---|---|
| #1 — Plumbing | FrameworkMetric struct, framework_run field, runner branch, --feature-set framework flag, JSON framework_metrics: [] (empty for kernels), CSV/Markdown unchanged. No new benchmarks. |
Yes |
#2 — graph_dividend |
One scenario, helper functions, README section, terminal-summary entry. | Yes |
#3 — verify_chain |
Adds --framework-chain-depths, reports verify_ms and ms/node slope. |
Yes |
#4 — parallel_branches |
Optional VX_NODE_TARGET pinning where supported. |
Yes |
#5 — async_streaming |
vxScheduleGraph / vxWaitGraph loop; gated pipelining-extension probe. |
Yes |
| #6 — Framework Score | Composite scoring, comparison.md extension. |
After #2–#5 |
| #7 — v2 backlog | per-node perf, map/unmap cost, user-kernel dispatch, context lifecycle, determinism under load. | Each its own PR |
PRs #1 and #2 together are the minimum useful slice — they give the headline "graph dividend" number on day one.
- Default behavior. ✅ Opt-in. Framework benchmarks ship behind
--feature-set framework(only) or--feature-set everything(kernels + framework). Bare./openvx-markis unchanged and produces the same Vision Score it always did. - Framework Score weight. ✅ Equal-weight geomean of
graph_speedup,virtual_dividend,parallelism_efficiency, andconcurrency_speedup. Lower-is-better metrics (verify_per_node_ms,async_overhead_ratio) andfusion_ratio(graph_dividend-only) are intentionally excluded so the score has a single monotonic interpretation and isn't over-weighted by any one scenario. - Chain content. ✅ Both.
graph_dividendships two chains:GraphDividend_Box3x3_x4(4 × Box3x3 — pure framework signal) andGraphDividend_MixedFilters(Gaussian → Box → Median → Erode — realistic). - Targets for
parallel_branches. ✅ Vendor-neutral v1. NoVX_NODE_TARGETpinning, no vendor-extension target enumeration. Scenario compares default-scheduler graph vs strictly-serial immediate-mode dispatch and letsparallelism_efficiencyspeak. - Pipelining extension. ✅ Skipped.
vx_khr_pipeliningwas deliberately not used in v1;async_streaminguses only standardvxScheduleGraph/vxWaitGraph, so it runs on every conformant implementation. - Naming. ✅ "OpenVX Framework Score" — emitted as
framework_scorein JSON and printed alongside Vision Score in the terminal summary.