Add parallel_branches framework benchmark

kiritigowda · cursoragent · kiritigowda · commit 3625bf88ed35 · 2026-05-15T09:37:43.000-07:00
Third framework scenario: build one graph with K = 4 independent Box3x3
nodes that share an input image and write to K independent outputs.
The K nodes have no data dependency on each other, so a competent
scheduler is free to dispatch them concurrently across cores or
targets. The strict-serial baseline is K back-to-back vxuBox3x3
immediate-mode calls, which admit no parallelism.

For each resolution, reports:

  branches                  K (4 in v1)
  serial_immediate_ms       K vxu calls back-to-back
  parallel_graph_ms         one graph, K independent branches
  parallelism_speedup       serial / parallel (K = perfect, 1 = none)
  parallelism_efficiency    speedup / K (1.0 = perfect K-way, 1/K = none)

This is the v1 vendor-neutral version of the heterogeneous-scheduling
question: no VX_NODE_TARGET pinning is attempted, no extension
enumeration. We just compare default-scheduler graph execution against
the strictly-serial immediate-mode dispatch and let the
parallelism_efficiency number speak.

Smoke results on MIVisionX (12-core macOS host) tell a clean story:

  VGA   parallelism_efficiency = 5.46  (&gt;1 -- dispatch savings dominate
                                        kernel work at small images)
  FHD   parallelism_efficiency = 0.96  (near-perfect 4x parallelism)
  4K    parallelism_efficiency = 0.41  (memory bandwidth saturates
                                        before cores do)

That is exactly the kind of insight no per-kernel benchmark can
surface. The README documents the three regimes (super-linear at
small res, near-perfect at FHD, bandwidth-bound at 4K) so the metric
is readable without surprise.

Out of scope:

  - --framework-branches CLI flag (K is fixed at 4 for v1; can be
    promoted later if cross-host comparisons demand it)
  - VX_NODE_TARGET pinning / vendor-extension target enumeration
  - per-branch chains (each branch is a single Box3x3; a multi-stage
    branch could be added without changing the metric set)

Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
diff --git a/README.md b/README.md
@@ -180,6 +180,7 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
 | `GraphDividend_Box3x3_x4` | Box3x3 × 4 | Pure framework overhead (same kernel, isolates orchestration cost) |
 | `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
 | `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
+| `ParallelBranches_Box3x3` | 4 independent Box3x3 branches sharing one input | Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency |
 
 Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
 
@@ -205,6 +206,21 @@ Each `GraphDividend_*` case times the same chain three ways and emits five metri
 
 Use `--framework-chain-depths 1,4,16,64,256` to sweep custom depths (defaults to `1,4,16,64`).
 
+`ParallelBranches_Box3x3` builds one graph with K = 4 independent Box3x3 nodes that share a single input image and write to K independent outputs. The K nodes have no data dependency on each other, so a competent scheduler is free to dispatch them concurrently across cores or targets. The strict-serial baseline is K back-to-back `vxuBox3x3` immediate-mode calls, which admit no parallelism.
+
+| Metric | Unit | Meaning |
+|:---|:---|:---|
+| `branches` | count | K — number of independent branches (4 in v1) |
+| `serial_immediate_ms` | ms | K back-to-back `vxuBox3x3` calls — strict-serial reference |
+| `parallel_graph_ms` | ms | One graph with K independent Box3x3 nodes — graph runtime is free to parallelize |
+| `parallelism_speedup` | × | `serial_immediate_ms / parallel_graph_ms`. **K = perfect parallelism, 1 = none** |
+| `parallelism_efficiency` | × | `parallelism_speedup / K`. **1.0 = perfect K-way parallelism, 1/K = none** |
+
+Interpreting `parallelism_efficiency`:
+- **≈ 1.0** at FHD or larger — the runtime is exploiting the K-way opportunity well (modulo memory bandwidth).
+- **> 1.0** at small resolutions — graph framework dispatch savings (the same effect measured by `graph_dividend`) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times.
+- **< 1/K** at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.
+
 ## Output
 
 ### Terminal Summary
diff --git a/src/benchmarks/framework_benchmarks.cpp b/src/benchmarks/framework_benchmarks.cpp
@@ -237,6 +237,148 @@ BenchmarkResult runGraphDividend(const std::vector<ChainStage>& stages,
     return r;
 }
 
+// Time K back-to-back vxuBox3x3 calls writing to K independent outputs.
+// This is the strict-serial baseline for the parallel_branches scenario:
+// immediate-mode dispatch admits no scheduling parallelism even on
+// multi-core hosts, so it is the right "no parallelism opportunity"
+// reference to compare against the graph form.
+double timeSerialImmediateBranches(vx_context ctx, uint32_t width, uint32_t height,
+                                   int branches, int warmup, int iterations,
+                                   TestDataGenerator& gen) {
+    if (branches < 1) return 0.0;
+    ResourceTracker tracker;
+
+    vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return 0.0;
+    tracker.trackImage(input);
+
+    std::vector<vx_image> outputs;
+    outputs.reserve(branches);
+    for (int i = 0; i < branches; i++) {
+        vx_image out = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
+        if (vxGetStatus((vx_reference)out) != VX_SUCCESS) return 0.0;
+        tracker.trackImage(out);
+        outputs.push_back(out);
+    }
+
+    auto runOnce = [&]() -> vx_status {
+        for (int i = 0; i < branches; i++) {
+            vx_status s = vxuBox3x3(ctx, input, outputs[i]);
+            if (s != VX_SUCCESS) return s;
+        }
+        return VX_SUCCESS;
+    };
+
+    for (int i = 0; i < warmup; i++) runOnce();
+
+    std::vector<double> samples;
+    samples.reserve(iterations);
+    BenchmarkTimer timer;
+    for (int i = 0; i < iterations; i++) {
+        timer.start();
+        if (runOnce() != VX_SUCCESS) return 0.0;
+        timer.stop();
+        samples.push_back(timer.elapsed_ns());
+    }
+    return BenchmarkStats::compute(samples).median_ns;
+}
+
+// Build one graph with K independent Box3x3 nodes, all reading the same
+// input and writing to K independent real outputs. This is a textbook
+// parallelism opportunity: the K nodes have no data dependency on each
+// other, so a competent scheduler is free to dispatch them concurrently
+// across cores / targets / queues.
+double timeParallelGraphBranches(vx_context ctx, uint32_t width, uint32_t height,
+                                 int branches, int warmup, int iterations,
+                                 TestDataGenerator& gen) {
+    if (branches < 1) return 0.0;
+    ResourceTracker tracker;
+
+    vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return 0.0;
+    tracker.trackImage(input);
+
+    vx_graph graph = vxCreateGraph(ctx);
+    if (vxGetStatus((vx_reference)graph) != VX_SUCCESS) return 0.0;
+    tracker.trackGraph(graph);
+
+    for (int i = 0; i < branches; i++) {
+        // Real (non-virtual) outputs ensure each branch produces an
+        // observable side effect so dead-code elimination can't silently
+        // drop branches.
+        vx_image out = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
+        if (vxGetStatus((vx_reference)out) != VX_SUCCESS) return 0.0;
+        tracker.trackImage(out);
+
+        vx_node node = vxBox3x3Node(graph, input, out);
+        if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return 0.0;
+        tracker.trackNode(node);
+    }
+
+    if (vxVerifyGraph(graph) != VX_SUCCESS) return 0.0;
+
+    for (int i = 0; i < warmup; i++) vxProcessGraph(graph);
+
+    std::vector<double> samples;
+    samples.reserve(iterations);
+    BenchmarkTimer timer;
+    for (int i = 0; i < iterations; i++) {
+        timer.start();
+        if (vxProcessGraph(graph) != VX_SUCCESS) return 0.0;
+        timer.stop();
+        samples.push_back(timer.elapsed_ns());
+    }
+    return BenchmarkStats::compute(samples).median_ns;
+}
+
+// K = 4 independent branches is enough opportunity to expose any scheduling
+// parallelism on every modern multi-core host, while keeping the work small
+// enough that a single-core fallback still completes quickly. Future PRs
+// can promote this to a CLI option if cross-machine variance demands it.
+constexpr int kParallelBranchesCount = 4;
+
+BenchmarkResult runParallelBranches(vx_context ctx, const Resolution& res,
+                                    const BenchmarkConfig& cfg) {
+    BenchmarkResult r;
+    r.iterations = cfg.iterations;
+    r.warmup = cfg.warmup;
+
+    const int K = kParallelBranchesCount;
+    TestDataGenerator gen(cfg.seed);
+
+    double t_serial_imm = timeSerialImmediateBranches(
+        ctx, res.width, res.height, K, cfg.warmup, cfg.iterations, gen);
+    double t_parallel = timeParallelGraphBranches(
+        ctx, res.width, res.height, K, cfg.warmup, cfg.iterations, gen);
+
+    if (t_serial_imm <= 0.0 || t_parallel <= 0.0) {
+        r.supported = false;
+        r.skip_reason = "parallel branches timing failed";
+        return r;
+    }
+
+    double speedup = t_serial_imm / t_parallel;
+    double efficiency = speedup / static_cast<double>(K);
+
+    r.framework_metrics = {
+        {"branches",                static_cast<double>(K), "count", false},
+        {"serial_immediate_ms",     t_serial_imm / 1e6,     "ms",    false},
+        {"parallel_graph_ms",       t_parallel   / 1e6,     "ms",    false},
+        {"parallelism_speedup",     speedup,                "x",     true},
+        {"parallelism_efficiency",  efficiency,             "x",     true},
+    };
+
+    r.wall_clock.median_ns = t_parallel;
+    r.wall_clock.mean_ns = t_parallel;
+    r.wall_clock.min_ns = t_parallel;
+    r.wall_clock.max_ns = t_parallel;
+    r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
+    r.megapixels_per_sec = BenchmarkStats::computeThroughput(
+        res.width, res.height, t_parallel);
+
+    return r;
+}
+
 // Per-N timings collected by runVerifyChain; one of these is produced per
 // chain depth and feeds both the per-N metrics and the slope regression.
 struct VerifySample {
@@ -530,5 +672,19 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
         cases.push_back(bc);
     }
 
+    {
+        BenchmarkCase bc;
+        bc.name = "ParallelBranches_Box3x3";
+        bc.category = "framework_parallel";
+        bc.feature_set = "framework";
+        bc.kernel_enum = VX_KERNEL_BOX_3x3;
+        bc.required_kernels = {VX_KERNEL_BOX_3x3};
+        bc.framework_run = [](vx_context ctx, const Resolution& res,
+                              const BenchmarkConfig& cfg) -> BenchmarkResult {
+            return runParallelBranches(ctx, res, cfg);
+        };
+        cases.push_back(bc);
+    }
+
     return cases;
 }