Skip to content

Commit 626e88f

Browse files
Add verify_chain framework benchmark
The second framework scenario: time the cost of building and verifying graphs of varying depth, plus the lazy-allocation tax paid on the very first vxProcessGraph call. For each chain depth N (configurable via --framework-chain-depths, default 1,4,16,64), the benchmark rebuilds a fresh chain of N Box3x3 nodes and times four phases per N: n{N}_create_ms vxCreateGraph + N node creations n{N}_verify_ms vxVerifyGraph n{N}_first_process_ms first vxProcessGraph (lazy alloc included) n{N}_steady_process_ms median vxProcessGraph after warmup A linear regression across the (N, verify_ms) samples then yields: verify_per_node_ms per-node verify slope (ms/node) verify_intercept_ms fixed verify cost first_process_overhead_ms first - steady at deepest chain (the one-shot tax: lazy alloc, kernel JIT, target affinity selection, etc.) These metrics tell the story of the OpenVX runtime's compilation behavior in a way that no per-kernel measurement can. They surface implementation choices like: - whether verify cost is linear, super-linear, or has step discontinuities (e.g. first call loads kernel modules) - how much per-node overhead the validator/optimizer adds - how aggressive lazy allocation is (a large first_process_overhead_ms means the impl defers most setup until actual execution) The runner already pre-checks bc.required_kernels, so the case skips cleanly on impls without Box3x3. Smoke results on MIVisionX show ~0.027 ms per added Box3x3 node during verify and ~12 ms first-process overhead at depth 64 -- previously invisible in any per-kernel benchmark. Out of scope: - other chain shapes (only Box3x3 here; mixed-kernel verify chains can be added later if useful) - re-verify cost on parameter or dimension changes - any heterogeneous-target scheduling effects (covered by PR #4) Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 7c93dfc commit 626e88f

4 files changed

Lines changed: 244 additions & 1 deletion

File tree

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ cmake --build .
120120
| `--seed N` | PRNG seed for reproducible test data | `42` |
121121
| `--stability-threshold N` | CV% threshold for stability warnings | `15` |
122122
| `--max-retries N` | Max retries for unstable benchmarks (2x iterations each retry) | `0` |
123+
| `--framework-chain-depths N,N,...` | Chain depths swept by `VerifyChain_Box3x3` | `1,4,16,64` |
123124

124125
#### Output
125126

@@ -178,6 +179,7 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
178179
|:---|:---|:---|
179180
| `GraphDividend_Box3x3_x4` | Box3x3 × 4 | Pure framework overhead (same kernel, isolates orchestration cost) |
180181
| `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
182+
| `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
181183

182184
Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
183185

@@ -189,6 +191,20 @@ Each `GraphDividend_*` case times the same chain three ways and emits five metri
189191
| `graph_speedup` | × | `sum_immediate_ms / graph_virtual_ms`. **>1 means the graph form beats summed immediate calls** — the headline framework dividend |
190192
| `virtual_dividend` | × | `graph_real_ms / graph_virtual_ms`. **>1 means virtual intermediates help** (runtime did something useful with the freedom) |
191193

194+
`VerifyChain_Box3x3` rebuilds a chain of N Box3x3 nodes for each requested depth and reports per-N timings plus three aggregate metrics:
195+
196+
| Metric | Unit | Meaning |
197+
|:---|:---|:---|
198+
| `n{N}_create_ms` | ms | `vxCreateGraph` + N node creations at depth N |
199+
| `n{N}_verify_ms` | ms | `vxVerifyGraph` cost at depth N |
200+
| `n{N}_first_process_ms` | ms | First `vxProcessGraph` call (often pays a one-shot lazy-allocation / kernel-init tax) |
201+
| `n{N}_steady_process_ms` | ms | Median `vxProcessGraph` cost after warmup |
202+
| `verify_per_node_ms` | ms/node | Linear-regression slope of verify cost over N — the per-node verify tax |
203+
| `verify_intercept_ms` | ms | Linear-regression intercept — fixed verify cost independent of chain length |
204+
| `first_process_overhead_ms` | ms | `first_process_ms - steady_process_ms` at the deepest chain — the cost of the first execution beyond steady state |
205+
206+
Use `--framework-chain-depths 1,4,16,64,256` to sweep custom depths (defaults to `1,4,16,64`).
207+
192208
## Output
193209

194210
### Terminal Summary

include/benchmark_config.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,12 @@ struct BenchmarkConfig {
6060

6161
// Comparison
6262
std::vector<std::string> compare_files;
63+
64+
// Framework benchmarks: chain depths used by verify_chain (number of
65+
// chained Box3x3 nodes). Each depth produces a per-N set of metrics and
66+
// contributes to the verify-cost-vs-N slope. Default sweeps 1, 4, 16, 64
67+
// nodes which is enough for a clean linear regression across most impls.
68+
std::vector<int> framework_chain_depths = {1, 4, 16, 64};
6369
};
6470

6571
// Default tensor dimensions for benchmarks

src/benchmarks/framework_benchmarks.cpp

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,197 @@ BenchmarkResult runGraphDividend(const std::vector<ChainStage>& stages,
237237
return r;
238238
}
239239

240+
// Per-N timings collected by runVerifyChain; one of these is produced per
241+
// chain depth and feeds both the per-N metrics and the slope regression.
242+
struct VerifySample {
243+
int n; // chain depth (number of Box3x3 nodes)
244+
double create_ms; // time to vxCreateGraph + add N nodes
245+
double verify_ms; // time of vxVerifyGraph
246+
double first_process_ms; // first vxProcessGraph (lazy alloc included)
247+
double steady_process_ms; // median of subsequent vxProcessGraph calls
248+
bool ok;
249+
};
250+
251+
// Build a graph of N Box3x3 nodes (input -> N-1 virtual intermediates ->
252+
// output) and return per-phase timings.
253+
VerifySample timeVerifyChain(vx_context ctx, uint32_t width, uint32_t height,
254+
int n, int warmup, int iterations,
255+
TestDataGenerator& gen) {
256+
VerifySample s{};
257+
s.n = n;
258+
if (n < 1) return s;
259+
260+
ResourceTracker tracker;
261+
262+
vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
263+
if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return s;
264+
tracker.trackImage(input);
265+
266+
vx_image output = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
267+
if (vxGetStatus((vx_reference)output) != VX_SUCCESS) return s;
268+
tracker.trackImage(output);
269+
270+
BenchmarkTimer timer;
271+
272+
// Phase 1: graph construction (vxCreateGraph + N node creations).
273+
timer.start();
274+
vx_graph graph = vxCreateGraph(ctx);
275+
if (vxGetStatus((vx_reference)graph) != VX_SUCCESS) return s;
276+
tracker.trackGraph(graph);
277+
278+
vx_image src = input;
279+
for (int i = 0; i < n; i++) {
280+
bool is_last = (i + 1 == n);
281+
vx_image dst = is_last
282+
? output
283+
: vxCreateVirtualImage(graph, width, height, VX_DF_IMAGE_U8);
284+
if (vxGetStatus((vx_reference)dst) != VX_SUCCESS) return s;
285+
if (!is_last) tracker.trackImage(dst);
286+
287+
vx_node node = vxBox3x3Node(graph, src, dst);
288+
if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return s;
289+
tracker.trackNode(node);
290+
291+
src = dst;
292+
}
293+
timer.stop();
294+
s.create_ms = timer.elapsed_ms();
295+
296+
// Phase 2: vxVerifyGraph. The headline framework metric.
297+
timer.start();
298+
if (vxVerifyGraph(graph) != VX_SUCCESS) return s;
299+
timer.stop();
300+
s.verify_ms = timer.elapsed_ms();
301+
302+
// Phase 3: first vxProcessGraph. Often pays a one-shot tax (lazy
303+
// allocation of execution state, kernel JIT, target affinity selection)
304+
// beyond the steady-state cost; this number minus the steady median is
305+
// a useful "warm-up" signal.
306+
timer.start();
307+
if (vxProcessGraph(graph) != VX_SUCCESS) return s;
308+
timer.stop();
309+
s.first_process_ms = timer.elapsed_ms();
310+
311+
// Phase 4: steady-state. Run cfg.warmup more then take median of
312+
// cfg.iterations samples.
313+
for (int i = 0; i < warmup; i++) vxProcessGraph(graph);
314+
std::vector<double> samples;
315+
samples.reserve(iterations);
316+
for (int i = 0; i < iterations; i++) {
317+
timer.start();
318+
if (vxProcessGraph(graph) != VX_SUCCESS) return s;
319+
timer.stop();
320+
samples.push_back(timer.elapsed_ns());
321+
}
322+
s.steady_process_ms = BenchmarkStats::compute(samples).median_ns / 1e6;
323+
s.ok = true;
324+
return s;
325+
}
326+
327+
// Linear regression over (n, verify_ms) samples returning slope and intercept
328+
// of verify_ms = intercept + slope * n. Falls back to 0 / first-sample if
329+
// fewer than 2 points are usable.
330+
void verifyRegression(const std::vector<VerifySample>& samples,
331+
double& slope_out, double& intercept_out) {
332+
slope_out = 0;
333+
intercept_out = 0;
334+
int count = 0;
335+
double sum_x = 0, sum_y = 0, sum_xx = 0, sum_xy = 0;
336+
for (const auto& s : samples) {
337+
if (!s.ok) continue;
338+
double x = static_cast<double>(s.n);
339+
double y = s.verify_ms;
340+
sum_x += x; sum_y += y;
341+
sum_xx += x * x; sum_xy += x * y;
342+
count++;
343+
}
344+
if (count < 2) {
345+
if (count == 1) intercept_out = sum_y;
346+
return;
347+
}
348+
double denom = count * sum_xx - sum_x * sum_x;
349+
if (denom == 0) return;
350+
slope_out = (count * sum_xy - sum_x * sum_y) / denom;
351+
intercept_out = (sum_y - slope_out * sum_x) / count;
352+
}
353+
354+
// Build a chain of N Box3x3 nodes for several N and report per-N create /
355+
// verify / first-process / steady-process timings, plus regression-derived
356+
// per-node verify slope and the lazy-alloc overhead at the deepest chain.
357+
BenchmarkResult runVerifyChain(vx_context ctx, const Resolution& res,
358+
const BenchmarkConfig& cfg) {
359+
BenchmarkResult r;
360+
r.iterations = cfg.iterations;
361+
r.warmup = cfg.warmup;
362+
363+
const auto& depths = cfg.framework_chain_depths;
364+
if (depths.empty()) {
365+
r.supported = false;
366+
r.skip_reason = "no chain depths configured";
367+
return r;
368+
}
369+
370+
TestDataGenerator gen(cfg.seed);
371+
std::vector<VerifySample> samples;
372+
samples.reserve(depths.size());
373+
for (int n : depths) {
374+
if (n < 1) continue;
375+
VerifySample s = timeVerifyChain(ctx, res.width, res.height, n,
376+
cfg.warmup, cfg.iterations, gen);
377+
if (!s.ok) {
378+
r.supported = false;
379+
r.skip_reason = "verify chain timing failed at depth " +
380+
std::to_string(n);
381+
return r;
382+
}
383+
samples.push_back(s);
384+
}
385+
386+
// Per-N metrics. Names embed the depth so downstream consumers can pick
387+
// them apart trivially.
388+
for (const auto& s : samples) {
389+
std::string p = "n" + std::to_string(s.n) + "_";
390+
r.framework_metrics.push_back({p + "create_ms",
391+
s.create_ms, "ms", false});
392+
r.framework_metrics.push_back({p + "verify_ms",
393+
s.verify_ms, "ms", false});
394+
r.framework_metrics.push_back({p + "first_process_ms",
395+
s.first_process_ms, "ms", false});
396+
r.framework_metrics.push_back({p + "steady_process_ms",
397+
s.steady_process_ms, "ms", false});
398+
}
399+
400+
// Aggregates: linear-regression slope + intercept of verify cost vs N,
401+
// and the first-process overhead at the deepest chain.
402+
double slope_ms_per_node = 0, intercept_ms = 0;
403+
verifyRegression(samples, slope_ms_per_node, intercept_ms);
404+
405+
r.framework_metrics.push_back({"verify_per_node_ms",
406+
slope_ms_per_node, "ms/node", false});
407+
r.framework_metrics.push_back({"verify_intercept_ms",
408+
intercept_ms, "ms", false});
409+
410+
const auto& deepest = samples.back();
411+
double first_overhead = deepest.first_process_ms - deepest.steady_process_ms;
412+
if (first_overhead < 0) first_overhead = 0;
413+
r.framework_metrics.push_back({"first_process_overhead_ms",
414+
first_overhead, "ms", false});
415+
416+
// Surface the deepest-chain steady-state time as the canonical
417+
// wall-clock so the row is sortable in scaling/top-N views without
418+
// polluting Vision Score (framework results are filtered out there).
419+
double primary_ns = deepest.steady_process_ms * 1e6;
420+
r.wall_clock.median_ns = primary_ns;
421+
r.wall_clock.mean_ns = primary_ns;
422+
r.wall_clock.min_ns = primary_ns;
423+
r.wall_clock.max_ns = primary_ns;
424+
r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
425+
r.megapixels_per_sec = BenchmarkStats::computeThroughput(
426+
res.width, res.height, primary_ns);
427+
428+
return r;
429+
}
430+
240431
// Build the canonical "pure framework" chain: 4 Box3x3 nodes back-to-back.
241432
std::vector<ChainStage> makeBox3x3Chain() {
242433
ChainStage box;
@@ -325,5 +516,19 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
325516
cases.push_back(bc);
326517
}
327518

519+
{
520+
BenchmarkCase bc;
521+
bc.name = "VerifyChain_Box3x3";
522+
bc.category = "framework_compile";
523+
bc.feature_set = "framework";
524+
bc.kernel_enum = VX_KERNEL_BOX_3x3;
525+
bc.required_kernels = {VX_KERNEL_BOX_3x3};
526+
bc.framework_run = [](vx_context ctx, const Resolution& res,
527+
const BenchmarkConfig& cfg) -> BenchmarkResult {
528+
return runVerifyChain(ctx, res, cfg);
529+
};
530+
cases.push_back(bc);
531+
}
532+
328533
return cases;
329534
}

src/main.cpp

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ static void printUsage(const char* prog) {
3939
printf(" --warmup N Warm-up iterations (default: 10)\n");
4040
printf(" --seed N PRNG seed (default: 42)\n");
4141
printf(" --stability-threshold N CV%% threshold for stability warning (default: 15)\n");
42-
printf(" --max-retries N Max retries for unstable benchmarks (default: 0)\n\n");
42+
printf(" --max-retries N Max retries for unstable benchmarks (default: 0)\n");
43+
printf(" --framework-chain-depths N,N Chain depths for verify_chain (default: 1,4,16,64)\n\n");
4344

4445
printf("Output:\n");
4546
printf(" --output-dir DIR Output directory (default: ./benchmark_results)\n");
@@ -161,6 +162,21 @@ static bool parseArgs(int argc, char* argv[], BenchmarkConfig& config) {
161162
config.stability_threshold = atof(argv[++i]);
162163
} else if (arg == "--max-retries" && i + 1 < argc) {
163164
config.max_retries = atoi(argv[++i]);
165+
} else if (arg == "--framework-chain-depths" && i + 1 < argc) {
166+
auto depth_strs = splitComma(argv[++i]);
167+
config.framework_chain_depths.clear();
168+
for (const auto& s : depth_strs) {
169+
int n = atoi(s.c_str());
170+
if (n > 0) {
171+
config.framework_chain_depths.push_back(n);
172+
} else {
173+
printf("WARNING: Invalid chain depth '%s', skipping\n", s.c_str());
174+
}
175+
}
176+
if (config.framework_chain_depths.empty()) {
177+
printf("ERROR: No valid framework chain depths specified\n");
178+
return false;
179+
}
164180
} else if (arg == "--compare" && i + 1 < argc) {
165181
config.compare_files = splitComma(argv[++i]);
166182
} else {

0 commit comments

Comments
 (0)