Skip to content

Commit a567312

Browse files
Add async_streaming framework benchmarks
Fourth and final v1 framework scenario family: surface the cost and benefit of the OpenVX asynchronous dispatch API (vxScheduleGraph + vxWaitGraph) compared to synchronous vxProcessGraph. Two scenarios using only standard OpenVX APIs (no extensions): Async_Single_Box3x3_x4 One verified Box3x3 chain graph timed both with vxProcessGraph (sync) and with vxScheduleGraph + vxWaitGraph (async). sync_ms median vxProcessGraph time async_ms median vxScheduleGraph + vxWaitGraph time async_overhead_ratio async_ms / sync_ms (1.0 = no tax; >1 = async API more expensive) Async_Concurrent_Box3x3_x2 Two independent Box3x3 chain graphs (no shared data) timed both sequentially via vxProcessGraph and concurrently via async dispatch. graphs 2 sync_sequential_ms sync, one after the other async_concurrent_ms schedule both, then wait for both concurrency_speedup sync_sequential / async_concurrent (>1 = runtime overlapped graphs) Smoke results on MIVisionX (12-core macOS host) reveal genuinely useful implementation characteristics: Async_Single VGA async_overhead_ratio = 4.75 (async path is 5x more expensive at small res -- thread setup dominates) FHD async_overhead_ratio = 0.91 (within noise; no measurable tax) 4K async_overhead_ratio = 1.00 (identical) Async_Concurrent VGA concurrency_speedup = 0.15 (async overhead so high it's 6.5x SLOWER than sequential) FHD concurrency_speedup = 0.87 (slightly slower than sequential) 4K concurrency_speedup = 1.28 (runtime DOES overlap independent graphs; 28% gain at 4K) This is exactly the kind of nuanced framework characteristic that no per-kernel measurement can surface: MIVisionX's async path has real overhead that pays off only when the per-graph work is large enough. A new helper buildBoxChainGraph centralizes Box3x3-chain construction for both async scenarios, replacing what would otherwise be a duplicate of timeGraphChain's inner loop. Pipelined streaming via the optional vx_khr_pipelining extension is intentionally out of scope for v1; both scenarios above use only standard OpenVX APIs and run on every conformant implementation. A pipelining-extension benchmark can be added as a follow-up if useful. This completes the v1 framework benchmark scenarios planned in PR #1's docs/framework-mark-plan.md (graph_dividend, verify_chain, parallel_branches, async_streaming). The remaining planned work is the composite OpenVX Framework Score and comparison-report extension (plan PR #6) plus the v2 backlog. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 90a087e commit a567312

2 files changed

Lines changed: 295 additions & 0 deletions

File tree

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,8 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
181181
| `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
182182
| `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
183183
| `ParallelBranches_Box3x3` | 4 independent Box3x3 branches sharing one input | Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency |
184+
| `Async_Single_Box3x3_x4` | One Box3x3 × 4 chain timed with `vxProcessGraph` and with `vxScheduleGraph`+`vxWaitGraph` | Cost of the async dispatch API on a single graph |
185+
| `Async_Concurrent_Box3x3_x2` | Two independent Box3x3 × 4 chain graphs | Whether the runtime overlaps independent graphs when scheduled concurrently |
184186

185187
Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
186188

@@ -221,6 +223,27 @@ Interpreting `parallelism_efficiency`:
221223
- **> 1.0** at small resolutions — graph framework dispatch savings (the same effect measured by `graph_dividend`) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times.
222224
- **< 1/K** at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.
223225

226+
`Async_Single_Box3x3_x4` runs one verified Box3x3 × 4 chain graph and times it both with synchronous `vxProcessGraph` and with the async pair `vxScheduleGraph` + `vxWaitGraph`. The point is to surface the cost of the async dispatch API itself.
227+
228+
| Metric | Unit | Meaning |
229+
|:---|:---|:---|
230+
| `sync_ms` | ms | Median `vxProcessGraph` time |
231+
| `async_ms` | ms | Median `vxScheduleGraph` + `vxWaitGraph` time |
232+
| `async_overhead_ratio` | × | `async_ms / sync_ms`. **Lower is better; 1.0 = no tax**, > 1 = the async API path is more expensive (typically thread-pool / signaling cost), < 1 = async path actually wins (rare but possible) |
233+
234+
`Async_Concurrent_Box3x3_x2` builds two independent Box3x3 × 4 chain graphs (no shared data) and times the pair two ways. The async form lets the runtime overlap the two graphs; the sync form does not.
235+
236+
| Metric | Unit | Meaning |
237+
|:---|:---|:---|
238+
| `graphs` | count | Number of independent graphs (2 in v1) |
239+
| `sync_sequential_ms` | ms | `vxProcessGraph(g0); vxProcessGraph(g1)` — strict serial |
240+
| `async_concurrent_ms` | ms | `vxScheduleGraph(g0); vxScheduleGraph(g1); vxWaitGraph(g0); vxWaitGraph(g1)` — runtime is free to overlap |
241+
| `concurrency_speedup` | × | `sync_sequential_ms / async_concurrent_ms`. **>1 = the runtime overlapped graphs**, ≈ 1 = it serialized them, < 1 = async overhead exceeded any concurrency gain |
242+
243+
`concurrency_speedup` < 1 at small resolutions is a real and useful signal: it means the implementation's async dispatch overhead exceeds any concurrency gain at that work size. The metric only becomes positive when the per-graph work is large enough to amortize the async path.
244+
245+
> Pipelined streaming via the optional `vx_khr_pipelining` extension is a future enhancement and is intentionally not implemented in this release; the two scenarios above use only standard OpenVX APIs and run on every conformant implementation.
246+
224247
## Output
225248

226249
### Terminal Summary

src/benchmarks/framework_benchmarks.cpp

Lines changed: 272 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -379,6 +379,250 @@ BenchmarkResult runParallelBranches(vx_context ctx, const Resolution& res,
379379
return r;
380380
}
381381

382+
// A pre-built and verified Box3x3 chain graph plus its observable input /
383+
// output handles. Lifetime is owned by the caller's ResourceTracker.
384+
struct AsyncChainGraph {
385+
vx_graph graph = nullptr;
386+
vx_image input = nullptr;
387+
vx_image output = nullptr;
388+
bool ok = false;
389+
};
390+
391+
AsyncChainGraph buildBoxChainGraph(vx_context ctx,
392+
uint32_t width, uint32_t height,
393+
int n_nodes,
394+
ResourceTracker& tracker,
395+
TestDataGenerator& gen) {
396+
AsyncChainGraph ag;
397+
if (n_nodes < 1) return ag;
398+
399+
ag.input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
400+
if (vxGetStatus((vx_reference)ag.input) != VX_SUCCESS) return ag;
401+
tracker.trackImage(ag.input);
402+
403+
ag.output = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
404+
if (vxGetStatus((vx_reference)ag.output) != VX_SUCCESS) return ag;
405+
tracker.trackImage(ag.output);
406+
407+
ag.graph = vxCreateGraph(ctx);
408+
if (vxGetStatus((vx_reference)ag.graph) != VX_SUCCESS) return ag;
409+
tracker.trackGraph(ag.graph);
410+
411+
vx_image src = ag.input;
412+
for (int i = 0; i < n_nodes; i++) {
413+
bool is_last = (i + 1 == n_nodes);
414+
vx_image dst = is_last
415+
? ag.output
416+
: vxCreateVirtualImage(ag.graph, width, height, VX_DF_IMAGE_U8);
417+
if (vxGetStatus((vx_reference)dst) != VX_SUCCESS) return ag;
418+
if (!is_last) tracker.trackImage(dst);
419+
420+
vx_node node = vxBox3x3Node(ag.graph, src, dst);
421+
if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return ag;
422+
tracker.trackNode(node);
423+
424+
src = dst;
425+
}
426+
427+
if (vxVerifyGraph(ag.graph) != VX_SUCCESS) return ag;
428+
ag.ok = true;
429+
return ag;
430+
}
431+
432+
// Single-graph async overhead: for one Box3x3-chain graph, time
433+
// vxProcessGraph (sync) and vxScheduleGraph + vxWaitGraph (async) and
434+
// report the ratio. Ideally async_overhead_ratio is close to 1.0; values
435+
// significantly > 1 mean the implementation pays a real per-call tax for
436+
// the async API. Below 1 is unusual but not impossible (e.g. async path
437+
// avoids extra copies) and is also useful diagnostic data.
438+
BenchmarkResult runAsyncSingle(vx_context ctx, const Resolution& res,
439+
const BenchmarkConfig& cfg) {
440+
BenchmarkResult r;
441+
r.iterations = cfg.iterations;
442+
r.warmup = cfg.warmup;
443+
444+
constexpr int kNodes = 4;
445+
TestDataGenerator gen(cfg.seed);
446+
ResourceTracker tracker;
447+
448+
AsyncChainGraph g = buildBoxChainGraph(ctx, res.width, res.height,
449+
kNodes, tracker, gen);
450+
if (!g.ok) {
451+
r.supported = false;
452+
r.skip_reason = "failed to build async chain graph";
453+
return r;
454+
}
455+
456+
BenchmarkTimer timer;
457+
458+
for (int i = 0; i < cfg.warmup; i++) vxProcessGraph(g.graph);
459+
460+
std::vector<double> sync_samples;
461+
sync_samples.reserve(cfg.iterations);
462+
for (int i = 0; i < cfg.iterations; i++) {
463+
timer.start();
464+
if (vxProcessGraph(g.graph) != VX_SUCCESS) {
465+
r.supported = false;
466+
r.skip_reason = "vxProcessGraph failed";
467+
return r;
468+
}
469+
timer.stop();
470+
sync_samples.push_back(timer.elapsed_ns());
471+
}
472+
double sync_ns = BenchmarkStats::compute(sync_samples).median_ns;
473+
474+
for (int i = 0; i < cfg.warmup; i++) {
475+
if (vxScheduleGraph(g.graph) != VX_SUCCESS) break;
476+
vxWaitGraph(g.graph);
477+
}
478+
479+
std::vector<double> async_samples;
480+
async_samples.reserve(cfg.iterations);
481+
for (int i = 0; i < cfg.iterations; i++) {
482+
timer.start();
483+
vx_status s = vxScheduleGraph(g.graph);
484+
if (s == VX_SUCCESS) s = vxWaitGraph(g.graph);
485+
timer.stop();
486+
if (s != VX_SUCCESS) {
487+
r.supported = false;
488+
r.skip_reason = "vxScheduleGraph/vxWaitGraph failed";
489+
return r;
490+
}
491+
async_samples.push_back(timer.elapsed_ns());
492+
}
493+
double async_ns = BenchmarkStats::compute(async_samples).median_ns;
494+
495+
if (sync_ns <= 0.0 || async_ns <= 0.0) {
496+
r.supported = false;
497+
r.skip_reason = "invalid timing";
498+
return r;
499+
}
500+
501+
r.framework_metrics = {
502+
{"sync_ms", sync_ns / 1e6, "ms", false},
503+
{"async_ms", async_ns / 1e6, "ms", false},
504+
// Ratio close to 1.0 = no tax; >1 = async API costs more than sync.
505+
{"async_overhead_ratio", async_ns / sync_ns, "x", false},
506+
};
507+
508+
r.wall_clock.median_ns = sync_ns;
509+
r.wall_clock.mean_ns = sync_ns;
510+
r.wall_clock.min_ns = sync_ns;
511+
r.wall_clock.max_ns = sync_ns;
512+
r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
513+
r.megapixels_per_sec = BenchmarkStats::computeThroughput(
514+
res.width, res.height, sync_ns);
515+
return r;
516+
}
517+
518+
// Multi-graph concurrency: build N independent Box3x3-chain graphs (no
519+
// shared data) and time them two ways:
520+
// sync_sequential_ms vxProcessGraph(g0); vxProcessGraph(g1); ...
521+
// async_concurrent_ms vxScheduleGraph(g0); ...; vxWaitGraph(g0); ...
522+
//
523+
// The async form gives the runtime the opportunity to overlap independent
524+
// graphs. concurrency_speedup > 1 means the runtime actually does so;
525+
// speedup ~ 1 means it serializes regardless of the API used.
526+
BenchmarkResult runAsyncConcurrent(vx_context ctx, const Resolution& res,
527+
const BenchmarkConfig& cfg) {
528+
BenchmarkResult r;
529+
r.iterations = cfg.iterations;
530+
r.warmup = cfg.warmup;
531+
532+
constexpr int kGraphs = 2;
533+
constexpr int kNodes = 4;
534+
TestDataGenerator gen(cfg.seed);
535+
ResourceTracker tracker;
536+
537+
std::vector<AsyncChainGraph> graphs;
538+
graphs.reserve(kGraphs);
539+
for (int i = 0; i < kGraphs; i++) {
540+
AsyncChainGraph g = buildBoxChainGraph(ctx, res.width, res.height,
541+
kNodes, tracker, gen);
542+
if (!g.ok) {
543+
r.supported = false;
544+
r.skip_reason = "failed to build async chain graph";
545+
return r;
546+
}
547+
graphs.push_back(g);
548+
}
549+
550+
BenchmarkTimer timer;
551+
552+
for (int i = 0; i < cfg.warmup; i++) {
553+
for (auto& g : graphs) vxProcessGraph(g.graph);
554+
}
555+
556+
std::vector<double> sync_samples;
557+
sync_samples.reserve(cfg.iterations);
558+
for (int i = 0; i < cfg.iterations; i++) {
559+
timer.start();
560+
for (auto& g : graphs) {
561+
if (vxProcessGraph(g.graph) != VX_SUCCESS) {
562+
r.supported = false;
563+
r.skip_reason = "vxProcessGraph failed";
564+
return r;
565+
}
566+
}
567+
timer.stop();
568+
sync_samples.push_back(timer.elapsed_ns());
569+
}
570+
double sync_ns = BenchmarkStats::compute(sync_samples).median_ns;
571+
572+
for (int i = 0; i < cfg.warmup; i++) {
573+
for (auto& g : graphs) vxScheduleGraph(g.graph);
574+
for (auto& g : graphs) vxWaitGraph(g.graph);
575+
}
576+
577+
std::vector<double> async_samples;
578+
async_samples.reserve(cfg.iterations);
579+
for (int i = 0; i < cfg.iterations; i++) {
580+
timer.start();
581+
for (auto& g : graphs) {
582+
if (vxScheduleGraph(g.graph) != VX_SUCCESS) {
583+
r.supported = false;
584+
r.skip_reason = "vxScheduleGraph failed";
585+
return r;
586+
}
587+
}
588+
for (auto& g : graphs) {
589+
if (vxWaitGraph(g.graph) != VX_SUCCESS) {
590+
r.supported = false;
591+
r.skip_reason = "vxWaitGraph failed";
592+
return r;
593+
}
594+
}
595+
timer.stop();
596+
async_samples.push_back(timer.elapsed_ns());
597+
}
598+
double async_ns = BenchmarkStats::compute(async_samples).median_ns;
599+
600+
if (sync_ns <= 0.0 || async_ns <= 0.0) {
601+
r.supported = false;
602+
r.skip_reason = "invalid timing";
603+
return r;
604+
}
605+
606+
double speedup = sync_ns / async_ns;
607+
608+
r.framework_metrics = {
609+
{"graphs", static_cast<double>(kGraphs), "count", false},
610+
{"sync_sequential_ms", sync_ns / 1e6, "ms", false},
611+
{"async_concurrent_ms", async_ns / 1e6, "ms", false},
612+
// >1 = runtime overlapped graphs; ~1 = no concurrency exploited.
613+
{"concurrency_speedup", speedup, "x", true},
614+
};
615+
616+
r.wall_clock.median_ns = async_ns;
617+
r.wall_clock.mean_ns = async_ns;
618+
r.wall_clock.min_ns = async_ns;
619+
r.wall_clock.max_ns = async_ns;
620+
r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
621+
r.megapixels_per_sec = BenchmarkStats::computeThroughput(
622+
res.width, res.height, async_ns);
623+
return r;
624+
}
625+
382626
// Per-N timings collected by runVerifyChain; one of these is produced per
383627
// chain depth and feeds both the per-N metrics and the slope regression.
384628
struct VerifySample {
@@ -686,5 +930,33 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
686930
cases.push_back(bc);
687931
}
688932

933+
{
934+
BenchmarkCase bc;
935+
bc.name = "Async_Single_Box3x3_x4";
936+
bc.category = "framework_async";
937+
bc.feature_set = "framework";
938+
bc.kernel_enum = VX_KERNEL_BOX_3x3;
939+
bc.required_kernels = {VX_KERNEL_BOX_3x3};
940+
bc.framework_run = [](vx_context ctx, const Resolution& res,
941+
const BenchmarkConfig& cfg) -> BenchmarkResult {
942+
return runAsyncSingle(ctx, res, cfg);
943+
};
944+
cases.push_back(bc);
945+
}
946+
947+
{
948+
BenchmarkCase bc;
949+
bc.name = "Async_Concurrent_Box3x3_x2";
950+
bc.category = "framework_async";
951+
bc.feature_set = "framework";
952+
bc.kernel_enum = VX_KERNEL_BOX_3x3;
953+
bc.required_kernels = {VX_KERNEL_BOX_3x3};
954+
bc.framework_run = [](vx_context ctx, const Resolution& res,
955+
const BenchmarkConfig& cfg) -> BenchmarkResult {
956+
return runAsyncConcurrent(ctx, res, cfg);
957+
};
958+
cases.push_back(bc);
959+
}
960+
689961
return cases;
690962
}

0 commit comments

Comments
 (0)