Skip to content

Commit 3625bf8

Browse files
Add parallel_branches framework benchmark
Third framework scenario: build one graph with K = 4 independent Box3x3 nodes that share an input image and write to K independent outputs. The K nodes have no data dependency on each other, so a competent scheduler is free to dispatch them concurrently across cores or targets. The strict-serial baseline is K back-to-back vxuBox3x3 immediate-mode calls, which admit no parallelism. For each resolution, reports: branches K (4 in v1) serial_immediate_ms K vxu calls back-to-back parallel_graph_ms one graph, K independent branches parallelism_speedup serial / parallel (K = perfect, 1 = none) parallelism_efficiency speedup / K (1.0 = perfect K-way, 1/K = none) This is the v1 vendor-neutral version of the heterogeneous-scheduling question: no VX_NODE_TARGET pinning is attempted, no extension enumeration. We just compare default-scheduler graph execution against the strictly-serial immediate-mode dispatch and let the parallelism_efficiency number speak. Smoke results on MIVisionX (12-core macOS host) tell a clean story: VGA parallelism_efficiency = 5.46 (>1 -- dispatch savings dominate kernel work at small images) FHD parallelism_efficiency = 0.96 (near-perfect 4x parallelism) 4K parallelism_efficiency = 0.41 (memory bandwidth saturates before cores do) That is exactly the kind of insight no per-kernel benchmark can surface. The README documents the three regimes (super-linear at small res, near-perfect at FHD, bandwidth-bound at 4K) so the metric is readable without surprise. Out of scope: - --framework-branches CLI flag (K is fixed at 4 for v1; can be promoted later if cross-host comparisons demand it) - VX_NODE_TARGET pinning / vendor-extension target enumeration - per-branch chains (each branch is a single Box3x3; a multi-stage branch could be added without changing the metric set) Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 33502b9 commit 3625bf8

2 files changed

Lines changed: 172 additions & 0 deletions

File tree

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
180180
| `GraphDividend_Box3x3_x4` | Box3x3 × 4 | Pure framework overhead (same kernel, isolates orchestration cost) |
181181
| `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
182182
| `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
183+
| `ParallelBranches_Box3x3` | 4 independent Box3x3 branches sharing one input | Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency |
183184

184185
Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
185186

@@ -205,6 +206,21 @@ Each `GraphDividend_*` case times the same chain three ways and emits five metri
205206

206207
Use `--framework-chain-depths 1,4,16,64,256` to sweep custom depths (defaults to `1,4,16,64`).
207208

209+
`ParallelBranches_Box3x3` builds one graph with K = 4 independent Box3x3 nodes that share a single input image and write to K independent outputs. The K nodes have no data dependency on each other, so a competent scheduler is free to dispatch them concurrently across cores or targets. The strict-serial baseline is K back-to-back `vxuBox3x3` immediate-mode calls, which admit no parallelism.
210+
211+
| Metric | Unit | Meaning |
212+
|:---|:---|:---|
213+
| `branches` | count | K — number of independent branches (4 in v1) |
214+
| `serial_immediate_ms` | ms | K back-to-back `vxuBox3x3` calls — strict-serial reference |
215+
| `parallel_graph_ms` | ms | One graph with K independent Box3x3 nodes — graph runtime is free to parallelize |
216+
| `parallelism_speedup` | × | `serial_immediate_ms / parallel_graph_ms`. **K = perfect parallelism, 1 = none** |
217+
| `parallelism_efficiency` | × | `parallelism_speedup / K`. **1.0 = perfect K-way parallelism, 1/K = none** |
218+
219+
Interpreting `parallelism_efficiency`:
220+
- **≈ 1.0** at FHD or larger — the runtime is exploiting the K-way opportunity well (modulo memory bandwidth).
221+
- **> 1.0** at small resolutions — graph framework dispatch savings (the same effect measured by `graph_dividend`) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times.
222+
- **< 1/K** at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.
223+
208224
## Output
209225

210226
### Terminal Summary

src/benchmarks/framework_benchmarks.cpp

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,148 @@ BenchmarkResult runGraphDividend(const std::vector<ChainStage>& stages,
237237
return r;
238238
}
239239

240+
// Time K back-to-back vxuBox3x3 calls writing to K independent outputs.
241+
// This is the strict-serial baseline for the parallel_branches scenario:
242+
// immediate-mode dispatch admits no scheduling parallelism even on
243+
// multi-core hosts, so it is the right "no parallelism opportunity"
244+
// reference to compare against the graph form.
245+
double timeSerialImmediateBranches(vx_context ctx, uint32_t width, uint32_t height,
246+
int branches, int warmup, int iterations,
247+
TestDataGenerator& gen) {
248+
if (branches < 1) return 0.0;
249+
ResourceTracker tracker;
250+
251+
vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
252+
if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return 0.0;
253+
tracker.trackImage(input);
254+
255+
std::vector<vx_image> outputs;
256+
outputs.reserve(branches);
257+
for (int i = 0; i < branches; i++) {
258+
vx_image out = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
259+
if (vxGetStatus((vx_reference)out) != VX_SUCCESS) return 0.0;
260+
tracker.trackImage(out);
261+
outputs.push_back(out);
262+
}
263+
264+
auto runOnce = [&]() -> vx_status {
265+
for (int i = 0; i < branches; i++) {
266+
vx_status s = vxuBox3x3(ctx, input, outputs[i]);
267+
if (s != VX_SUCCESS) return s;
268+
}
269+
return VX_SUCCESS;
270+
};
271+
272+
for (int i = 0; i < warmup; i++) runOnce();
273+
274+
std::vector<double> samples;
275+
samples.reserve(iterations);
276+
BenchmarkTimer timer;
277+
for (int i = 0; i < iterations; i++) {
278+
timer.start();
279+
if (runOnce() != VX_SUCCESS) return 0.0;
280+
timer.stop();
281+
samples.push_back(timer.elapsed_ns());
282+
}
283+
return BenchmarkStats::compute(samples).median_ns;
284+
}
285+
286+
// Build one graph with K independent Box3x3 nodes, all reading the same
287+
// input and writing to K independent real outputs. This is a textbook
288+
// parallelism opportunity: the K nodes have no data dependency on each
289+
// other, so a competent scheduler is free to dispatch them concurrently
290+
// across cores / targets / queues.
291+
double timeParallelGraphBranches(vx_context ctx, uint32_t width, uint32_t height,
292+
int branches, int warmup, int iterations,
293+
TestDataGenerator& gen) {
294+
if (branches < 1) return 0.0;
295+
ResourceTracker tracker;
296+
297+
vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
298+
if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return 0.0;
299+
tracker.trackImage(input);
300+
301+
vx_graph graph = vxCreateGraph(ctx);
302+
if (vxGetStatus((vx_reference)graph) != VX_SUCCESS) return 0.0;
303+
tracker.trackGraph(graph);
304+
305+
for (int i = 0; i < branches; i++) {
306+
// Real (non-virtual) outputs ensure each branch produces an
307+
// observable side effect so dead-code elimination can't silently
308+
// drop branches.
309+
vx_image out = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
310+
if (vxGetStatus((vx_reference)out) != VX_SUCCESS) return 0.0;
311+
tracker.trackImage(out);
312+
313+
vx_node node = vxBox3x3Node(graph, input, out);
314+
if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return 0.0;
315+
tracker.trackNode(node);
316+
}
317+
318+
if (vxVerifyGraph(graph) != VX_SUCCESS) return 0.0;
319+
320+
for (int i = 0; i < warmup; i++) vxProcessGraph(graph);
321+
322+
std::vector<double> samples;
323+
samples.reserve(iterations);
324+
BenchmarkTimer timer;
325+
for (int i = 0; i < iterations; i++) {
326+
timer.start();
327+
if (vxProcessGraph(graph) != VX_SUCCESS) return 0.0;
328+
timer.stop();
329+
samples.push_back(timer.elapsed_ns());
330+
}
331+
return BenchmarkStats::compute(samples).median_ns;
332+
}
333+
334+
// K = 4 independent branches is enough opportunity to expose any scheduling
335+
// parallelism on every modern multi-core host, while keeping the work small
336+
// enough that a single-core fallback still completes quickly. Future PRs
337+
// can promote this to a CLI option if cross-machine variance demands it.
338+
constexpr int kParallelBranchesCount = 4;
339+
340+
BenchmarkResult runParallelBranches(vx_context ctx, const Resolution& res,
341+
const BenchmarkConfig& cfg) {
342+
BenchmarkResult r;
343+
r.iterations = cfg.iterations;
344+
r.warmup = cfg.warmup;
345+
346+
const int K = kParallelBranchesCount;
347+
TestDataGenerator gen(cfg.seed);
348+
349+
double t_serial_imm = timeSerialImmediateBranches(
350+
ctx, res.width, res.height, K, cfg.warmup, cfg.iterations, gen);
351+
double t_parallel = timeParallelGraphBranches(
352+
ctx, res.width, res.height, K, cfg.warmup, cfg.iterations, gen);
353+
354+
if (t_serial_imm <= 0.0 || t_parallel <= 0.0) {
355+
r.supported = false;
356+
r.skip_reason = "parallel branches timing failed";
357+
return r;
358+
}
359+
360+
double speedup = t_serial_imm / t_parallel;
361+
double efficiency = speedup / static_cast<double>(K);
362+
363+
r.framework_metrics = {
364+
{"branches", static_cast<double>(K), "count", false},
365+
{"serial_immediate_ms", t_serial_imm / 1e6, "ms", false},
366+
{"parallel_graph_ms", t_parallel / 1e6, "ms", false},
367+
{"parallelism_speedup", speedup, "x", true},
368+
{"parallelism_efficiency", efficiency, "x", true},
369+
};
370+
371+
r.wall_clock.median_ns = t_parallel;
372+
r.wall_clock.mean_ns = t_parallel;
373+
r.wall_clock.min_ns = t_parallel;
374+
r.wall_clock.max_ns = t_parallel;
375+
r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
376+
r.megapixels_per_sec = BenchmarkStats::computeThroughput(
377+
res.width, res.height, t_parallel);
378+
379+
return r;
380+
}
381+
240382
// Per-N timings collected by runVerifyChain; one of these is produced per
241383
// chain depth and feeds both the per-N metrics and the slope regression.
242384
struct VerifySample {
@@ -530,5 +672,19 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
530672
cases.push_back(bc);
531673
}
532674

675+
{
676+
BenchmarkCase bc;
677+
bc.name = "ParallelBranches_Box3x3";
678+
bc.category = "framework_parallel";
679+
bc.feature_set = "framework";
680+
bc.kernel_enum = VX_KERNEL_BOX_3x3;
681+
bc.required_kernels = {VX_KERNEL_BOX_3x3};
682+
bc.framework_run = [](vx_context ctx, const Resolution& res,
683+
const BenchmarkConfig& cfg) -> BenchmarkResult {
684+
return runParallelBranches(ctx, res, cfg);
685+
};
686+
cases.push_back(bc);
687+
}
688+
533689
return cases;
534690
}

0 commit comments

Comments
 (0)