Add async_streaming framework benchmarks

kiritigowda · cursoragent · kiritigowda · commit a567312e3e2f · 2026-05-15T09:52:57.000-07:00
Fourth and final v1 framework scenario family: surface the cost and benefit of the OpenVX asynchronous dispatch API (vxScheduleGraph + vxWaitGraph) compared to synchronous vxProcessGraph. Two scenarios using only standard OpenVX APIs (no extensions): Async_Single_Box3x3_x4 One verified Box3x3 chain graph timed both with vxProcessGraph (sync) and with vxScheduleGraph + vxWaitGraph (async). sync_ms median vxProcessGraph time async_ms median vxScheduleGraph + vxWaitGraph time async_overhead_ratio async_ms / sync_ms (1.0 = no tax; >1 = async API more expensive) Async_Concurrent_Box3x3_x2 Two independent Box3x3 chain graphs (no shared data) timed both sequentially via vxProcessGraph and concurrently via async dispatch. graphs 2 sync_sequential_ms sync, one after the other async_concurrent_ms schedule both, then wait for both concurrency_speedup sync_sequential / async_concurrent (>1 = runtime overlapped graphs) Smoke results on MIVisionX (12-core macOS host) reveal genuinely useful implementation characteristics: Async_Single VGA async_overhead_ratio = 4.75 (async path is 5x more expensive at small res -- thread setup dominates) FHD async_overhead_ratio = 0.91 (within noise; no measurable tax) 4K async_overhead_ratio = 1.00 (identical) Async_Concurrent VGA concurrency_speedup = 0.15 (async overhead so high it's 6.5x SLOWER than sequential) FHD concurrency_speedup = 0.87 (slightly slower than sequential) 4K concurrency_speedup = 1.28 (runtime DOES overlap independent graphs; 28% gain at 4K) This is exactly the kind of nuanced framework characteristic that no per-kernel measurement can surface: MIVisionX's async path has real overhead that pays off only when the per-graph work is large enough. A new helper buildBoxChainGraph centralizes Box3x3-chain construction for both async scenarios, replacing what would otherwise be a duplicate of timeGraphChain's inner loop. Pipelined streaming via the optional vx_khr_pipelining extension is intentionally out of scope for v1; both scenarios above use only standard OpenVX APIs and run on every conformant implementation. A pipelining-extension benchmark can be added as a follow-up if useful. This completes the v1 framework benchmark scenarios planned in PR #1's docs/framework-mark-plan.md (graph_dividend, verify_chain, parallel_branches, async_streaming). The remaining planned work is the composite OpenVX Framework Score and comparison-report extension (plan PR #6) plus the v2 backlog. Co-authored-by: Cursor <cursoragent@cursor.com>
diff --git a/README.md b/README.md
@@ -181,6 +181,8 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
 | `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
 | `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
 | `ParallelBranches_Box3x3` | 4 independent Box3x3 branches sharing one input | Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency |
+| `Async_Single_Box3x3_x4` | One Box3x3 × 4 chain timed with `vxProcessGraph` and with `vxScheduleGraph`+`vxWaitGraph` | Cost of the async dispatch API on a single graph |
+| `Async_Concurrent_Box3x3_x2` | Two independent Box3x3 × 4 chain graphs | Whether the runtime overlaps independent graphs when scheduled concurrently |
 
 Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
 
@@ -221,6 +223,27 @@ Interpreting `parallelism_efficiency`:
 - **> 1.0** at small resolutions — graph framework dispatch savings (the same effect measured by `graph_dividend`) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times.
 - **< 1/K** at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.
 
+`Async_Single_Box3x3_x4` runs one verified Box3x3 × 4 chain graph and times it both with synchronous `vxProcessGraph` and with the async pair `vxScheduleGraph` + `vxWaitGraph`. The point is to surface the cost of the async dispatch API itself.
+
+| Metric | Unit | Meaning |
+|:---|:---|:---|
+| `sync_ms` | ms | Median `vxProcessGraph` time |
+| `async_ms` | ms | Median `vxScheduleGraph` + `vxWaitGraph` time |
+| `async_overhead_ratio` | × | `async_ms / sync_ms`. **Lower is better; 1.0 = no tax**, > 1 = the async API path is more expensive (typically thread-pool / signaling cost), < 1 = async path actually wins (rare but possible) |
+
+`Async_Concurrent_Box3x3_x2` builds two independent Box3x3 × 4 chain graphs (no shared data) and times the pair two ways. The async form lets the runtime overlap the two graphs; the sync form does not.
+
+| Metric | Unit | Meaning |
+|:---|:---|:---|
+| `graphs` | count | Number of independent graphs (2 in v1) |
+| `sync_sequential_ms` | ms | `vxProcessGraph(g0); vxProcessGraph(g1)` — strict serial |
+| `async_concurrent_ms` | ms | `vxScheduleGraph(g0); vxScheduleGraph(g1); vxWaitGraph(g0); vxWaitGraph(g1)` — runtime is free to overlap |
+| `concurrency_speedup` | × | `sync_sequential_ms / async_concurrent_ms`. **>1 = the runtime overlapped graphs**, ≈ 1 = it serialized them, < 1 = async overhead exceeded any concurrency gain |
+
+`concurrency_speedup` < 1 at small resolutions is a real and useful signal: it means the implementation's async dispatch overhead exceeds any concurrency gain at that work size. The metric only becomes positive when the per-graph work is large enough to amortize the async path.
+
+> Pipelined streaming via the optional `vx_khr_pipelining` extension is a future enhancement and is intentionally not implemented in this release; the two scenarios above use only standard OpenVX APIs and run on every conformant implementation.
+
 ## Output
 
 ### Terminal Summary
diff --git a/src/benchmarks/framework_benchmarks.cpp b/src/benchmarks/framework_benchmarks.cpp
@@ -379,6 +379,250 @@ BenchmarkResult runParallelBranches(vx_context ctx, const Resolution& res,
     return r;
 }
 
+// A pre-built and verified Box3x3 chain graph plus its observable input /
+// output handles. Lifetime is owned by the caller's ResourceTracker.
+struct AsyncChainGraph {
+    vx_graph graph = nullptr;
+    vx_image input = nullptr;
+    vx_image output = nullptr;
+    bool ok = false;
+};
+
+AsyncChainGraph buildBoxChainGraph(vx_context ctx,
+                                   uint32_t width, uint32_t height,
+                                   int n_nodes,
+                                   ResourceTracker& tracker,
+                                   TestDataGenerator& gen) {
+    AsyncChainGraph ag;
+    if (n_nodes < 1) return ag;
+
+    ag.input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)ag.input) != VX_SUCCESS) return ag;
+    tracker.trackImage(ag.input);
+
+    ag.output = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)ag.output) != VX_SUCCESS) return ag;
+    tracker.trackImage(ag.output);
+
+    ag.graph = vxCreateGraph(ctx);
+    if (vxGetStatus((vx_reference)ag.graph) != VX_SUCCESS) return ag;
+    tracker.trackGraph(ag.graph);
+
+    vx_image src = ag.input;
+    for (int i = 0; i < n_nodes; i++) {
+        bool is_last = (i + 1 == n_nodes);
+        vx_image dst = is_last
+            ? ag.output
+            : vxCreateVirtualImage(ag.graph, width, height, VX_DF_IMAGE_U8);
+        if (vxGetStatus((vx_reference)dst) != VX_SUCCESS) return ag;
+        if (!is_last) tracker.trackImage(dst);
+
+        vx_node node = vxBox3x3Node(ag.graph, src, dst);
+        if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return ag;
+        tracker.trackNode(node);
+
+        src = dst;
+    }
+
+    if (vxVerifyGraph(ag.graph) != VX_SUCCESS) return ag;
+    ag.ok = true;
+    return ag;
+}
+
+// Single-graph async overhead: for one Box3x3-chain graph, time
+// vxProcessGraph (sync) and vxScheduleGraph + vxWaitGraph (async) and
+// report the ratio. Ideally async_overhead_ratio is close to 1.0; values
+// significantly > 1 mean the implementation pays a real per-call tax for
+// the async API. Below 1 is unusual but not impossible (e.g. async path
+// avoids extra copies) and is also useful diagnostic data.
+BenchmarkResult runAsyncSingle(vx_context ctx, const Resolution& res,
+                               const BenchmarkConfig& cfg) {
+    BenchmarkResult r;
+    r.iterations = cfg.iterations;
+    r.warmup = cfg.warmup;
+
+    constexpr int kNodes = 4;
+    TestDataGenerator gen(cfg.seed);
+    ResourceTracker tracker;
+
+    AsyncChainGraph g = buildBoxChainGraph(ctx, res.width, res.height,
+                                           kNodes, tracker, gen);
+    if (!g.ok) {
+        r.supported = false;
+        r.skip_reason = "failed to build async chain graph";
+        return r;
+    }
+
+    BenchmarkTimer timer;
+
+    for (int i = 0; i < cfg.warmup; i++) vxProcessGraph(g.graph);
+
+    std::vector<double> sync_samples;
+    sync_samples.reserve(cfg.iterations);
+    for (int i = 0; i < cfg.iterations; i++) {
+        timer.start();
+        if (vxProcessGraph(g.graph) != VX_SUCCESS) {
+            r.supported = false;
+            r.skip_reason = "vxProcessGraph failed";
+            return r;
+        }
+        timer.stop();
+        sync_samples.push_back(timer.elapsed_ns());
+    }
+    double sync_ns = BenchmarkStats::compute(sync_samples).median_ns;
+
+    for (int i = 0; i < cfg.warmup; i++) {
+        if (vxScheduleGraph(g.graph) != VX_SUCCESS) break;
+        vxWaitGraph(g.graph);
+    }
+
+    std::vector<double> async_samples;
+    async_samples.reserve(cfg.iterations);
+    for (int i = 0; i < cfg.iterations; i++) {
+        timer.start();
+        vx_status s = vxScheduleGraph(g.graph);
+        if (s == VX_SUCCESS) s = vxWaitGraph(g.graph);
+        timer.stop();
+        if (s != VX_SUCCESS) {
+            r.supported = false;
+            r.skip_reason = "vxScheduleGraph/vxWaitGraph failed";
+            return r;
+        }
+        async_samples.push_back(timer.elapsed_ns());
+    }
+    double async_ns = BenchmarkStats::compute(async_samples).median_ns;
+
+    if (sync_ns <= 0.0 || async_ns <= 0.0) {
+        r.supported = false;
+        r.skip_reason = "invalid timing";
+        return r;
+    }
+
+    r.framework_metrics = {
+        {"sync_ms",              sync_ns  / 1e6,     "ms", false},
+        {"async_ms",             async_ns / 1e6,     "ms", false},
+        // Ratio close to 1.0 = no tax; >1 = async API costs more than sync.
+        {"async_overhead_ratio", async_ns / sync_ns, "x",  false},
+    };
+
+    r.wall_clock.median_ns = sync_ns;
+    r.wall_clock.mean_ns = sync_ns;
+    r.wall_clock.min_ns = sync_ns;
+    r.wall_clock.max_ns = sync_ns;
+    r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
+    r.megapixels_per_sec = BenchmarkStats::computeThroughput(
+        res.width, res.height, sync_ns);
+    return r;
+}
+
+// Multi-graph concurrency: build N independent Box3x3-chain graphs (no
+// shared data) and time them two ways:
+//   sync_sequential_ms   vxProcessGraph(g0); vxProcessGraph(g1); ...
+//   async_concurrent_ms  vxScheduleGraph(g0); ...; vxWaitGraph(g0); ...
+//
+// The async form gives the runtime the opportunity to overlap independent
+// graphs. concurrency_speedup > 1 means the runtime actually does so;
+// speedup ~ 1 means it serializes regardless of the API used.
+BenchmarkResult runAsyncConcurrent(vx_context ctx, const Resolution& res,
+                                   const BenchmarkConfig& cfg) {
+    BenchmarkResult r;
+    r.iterations = cfg.iterations;
+    r.warmup = cfg.warmup;
+
+    constexpr int kGraphs = 2;
+    constexpr int kNodes = 4;
+    TestDataGenerator gen(cfg.seed);
+    ResourceTracker tracker;
+
+    std::vector<AsyncChainGraph> graphs;
+    graphs.reserve(kGraphs);
+    for (int i = 0; i < kGraphs; i++) {
+        AsyncChainGraph g = buildBoxChainGraph(ctx, res.width, res.height,
+                                               kNodes, tracker, gen);
+        if (!g.ok) {
+            r.supported = false;
+            r.skip_reason = "failed to build async chain graph";
+            return r;
+        }
+        graphs.push_back(g);
+    }
+
+    BenchmarkTimer timer;
+
+    for (int i = 0; i < cfg.warmup; i++) {
+        for (auto& g : graphs) vxProcessGraph(g.graph);
+    }
+
+    std::vector<double> sync_samples;
+    sync_samples.reserve(cfg.iterations);
+    for (int i = 0; i < cfg.iterations; i++) {
+        timer.start();
+        for (auto& g : graphs) {
+            if (vxProcessGraph(g.graph) != VX_SUCCESS) {
+                r.supported = false;
+                r.skip_reason = "vxProcessGraph failed";
+                return r;
+            }
+        }
+        timer.stop();
+        sync_samples.push_back(timer.elapsed_ns());
+    }
+    double sync_ns = BenchmarkStats::compute(sync_samples).median_ns;
+
+    for (int i = 0; i < cfg.warmup; i++) {
+        for (auto& g : graphs) vxScheduleGraph(g.graph);
+        for (auto& g : graphs) vxWaitGraph(g.graph);
+    }
+
+    std::vector<double> async_samples;
+    async_samples.reserve(cfg.iterations);
+    for (int i = 0; i < cfg.iterations; i++) {
+        timer.start();
+        for (auto& g : graphs) {
+            if (vxScheduleGraph(g.graph) != VX_SUCCESS) {
+                r.supported = false;
+                r.skip_reason = "vxScheduleGraph failed";
+                return r;
+            }
+        }
+        for (auto& g : graphs) {
+            if (vxWaitGraph(g.graph) != VX_SUCCESS) {
+                r.supported = false;
+                r.skip_reason = "vxWaitGraph failed";
+                return r;
+            }
+        }
+        timer.stop();
+        async_samples.push_back(timer.elapsed_ns());
+    }
+    double async_ns = BenchmarkStats::compute(async_samples).median_ns;
+
+    if (sync_ns <= 0.0 || async_ns <= 0.0) {
+        r.supported = false;
+        r.skip_reason = "invalid timing";
+        return r;
+    }
+
+    double speedup = sync_ns / async_ns;
+
+    r.framework_metrics = {
+        {"graphs",              static_cast<double>(kGraphs), "count", false},
+        {"sync_sequential_ms",  sync_ns  / 1e6,               "ms",    false},
+        {"async_concurrent_ms", async_ns / 1e6,               "ms",    false},
+        // >1 = runtime overlapped graphs; ~1 = no concurrency exploited.
+        {"concurrency_speedup", speedup,                      "x",     true},
+    };
+
+    r.wall_clock.median_ns = async_ns;
+    r.wall_clock.mean_ns = async_ns;
+    r.wall_clock.min_ns = async_ns;
+    r.wall_clock.max_ns = async_ns;
+    r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
+    r.megapixels_per_sec = BenchmarkStats::computeThroughput(
+        res.width, res.height, async_ns);
+    return r;
+}
+
 // Per-N timings collected by runVerifyChain; one of these is produced per
 // chain depth and feeds both the per-N metrics and the slope regression.
 struct VerifySample {
@@ -686,5 +930,33 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
         cases.push_back(bc);
     }
 
+    {
+        BenchmarkCase bc;
+        bc.name = "Async_Single_Box3x3_x4";
+        bc.category = "framework_async";
+        bc.feature_set = "framework";
+        bc.kernel_enum = VX_KERNEL_BOX_3x3;
+        bc.required_kernels = {VX_KERNEL_BOX_3x3};
+        bc.framework_run = [](vx_context ctx, const Resolution& res,
+                              const BenchmarkConfig& cfg) -> BenchmarkResult {
+            return runAsyncSingle(ctx, res, cfg);
+        };
+        cases.push_back(bc);
+    }
+
+    {
+        BenchmarkCase bc;
+        bc.name = "Async_Concurrent_Box3x3_x2";
+        bc.category = "framework_async";
+        bc.feature_set = "framework";
+        bc.kernel_enum = VX_KERNEL_BOX_3x3;
+        bc.required_kernels = {VX_KERNEL_BOX_3x3};
+        bc.framework_run = [](vx_context ctx, const Resolution& res,
+                              const BenchmarkConfig& cfg) -> BenchmarkResult {
+            return runAsyncConcurrent(ctx, res, cfg);
+        };
+        cases.push_back(bc);
+    }
+
     return cases;
 }