Add verify_chain framework benchmark

kiritigowda · cursoragent · kiritigowda · commit 626e88fc4d03 · 2026-05-15T00:31:31.000-07:00
The second framework scenario: time the cost of building and verifying graphs of varying depth, plus the lazy-allocation tax paid on the very first vxProcessGraph call. For each chain depth N (configurable via --framework-chain-depths, default 1,4,16,64), the benchmark rebuilds a fresh chain of N Box3x3 nodes and times four phases per N: n{N}_create_ms vxCreateGraph + N node creations n{N}_verify_ms vxVerifyGraph n{N}_first_process_ms first vxProcessGraph (lazy alloc included) n{N}_steady_process_ms median vxProcessGraph after warmup A linear regression across the (N, verify_ms) samples then yields: verify_per_node_ms per-node verify slope (ms/node) verify_intercept_ms fixed verify cost first_process_overhead_ms first - steady at deepest chain (the one-shot tax: lazy alloc, kernel JIT, target affinity selection, etc.) These metrics tell the story of the OpenVX runtime's compilation behavior in a way that no per-kernel measurement can. They surface implementation choices like: - whether verify cost is linear, super-linear, or has step discontinuities (e.g. first call loads kernel modules) - how much per-node overhead the validator/optimizer adds - how aggressive lazy allocation is (a large first_process_overhead_ms means the impl defers most setup until actual execution) The runner already pre-checks bc.required_kernels, so the case skips cleanly on impls without Box3x3. Smoke results on MIVisionX show ~0.027 ms per added Box3x3 node during verify and ~12 ms first-process overhead at depth 64 -- previously invisible in any per-kernel benchmark. Out of scope: - other chain shapes (only Box3x3 here; mixed-kernel verify chains can be added later if useful) - re-verify cost on parameter or dimension changes - any heterogeneous-target scheduling effects (covered by PR #4) Co-authored-by: Cursor <cursoragent@cursor.com>
diff --git a/README.md b/README.md
@@ -120,6 +120,7 @@ cmake --build .
 | `--seed N` | PRNG seed for reproducible test data | `42` |
 | `--stability-threshold N` | CV% threshold for stability warnings | `15` |
 | `--max-retries N` | Max retries for unstable benchmarks (2x iterations each retry) | `0` |
+| `--framework-chain-depths N,N,...` | Chain depths swept by `VerifyChain_Box3x3` | `1,4,16,64` |
 
 #### Output
 
@@ -178,6 +179,7 @@ Framework benchmarks are **opt-in** — they are not in the default run and do n
 |:---|:---|:---|
 | `GraphDividend_Box3x3_x4` | Box3x3 × 4 | Pure framework overhead (same kernel, isolates orchestration cost) |
 | `GraphDividend_MixedFilters` | Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
+| `VerifyChain_Box3x3` | Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64) | Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
 
 Each `GraphDividend_*` case times the same chain three ways and emits five metrics:
 
@@ -189,6 +191,20 @@ Each `GraphDividend_*` case times the same chain three ways and emits five metri
 | `graph_speedup` | × | `sum_immediate_ms / graph_virtual_ms`. **>1 means the graph form beats summed immediate calls** — the headline framework dividend |
 | `virtual_dividend` | × | `graph_real_ms / graph_virtual_ms`. **>1 means virtual intermediates help** (runtime did something useful with the freedom) |
 
+`VerifyChain_Box3x3` rebuilds a chain of N Box3x3 nodes for each requested depth and reports per-N timings plus three aggregate metrics:
+
+| Metric | Unit | Meaning |
+|:---|:---|:---|
+| `n{N}_create_ms` | ms | `vxCreateGraph` + N node creations at depth N |
+| `n{N}_verify_ms` | ms | `vxVerifyGraph` cost at depth N |
+| `n{N}_first_process_ms` | ms | First `vxProcessGraph` call (often pays a one-shot lazy-allocation / kernel-init tax) |
+| `n{N}_steady_process_ms` | ms | Median `vxProcessGraph` cost after warmup |
+| `verify_per_node_ms` | ms/node | Linear-regression slope of verify cost over N — the per-node verify tax |
+| `verify_intercept_ms` | ms | Linear-regression intercept — fixed verify cost independent of chain length |
+| `first_process_overhead_ms` | ms | `first_process_ms - steady_process_ms` at the deepest chain — the cost of the first execution beyond steady state |
+
+Use `--framework-chain-depths 1,4,16,64,256` to sweep custom depths (defaults to `1,4,16,64`).
+
 ## Output
 
 ### Terminal Summary
diff --git a/include/benchmark_config.h b/include/benchmark_config.h
@@ -60,6 +60,12 @@ struct BenchmarkConfig {
 
     // Comparison
     std::vector<std::string> compare_files;
+
+    // Framework benchmarks: chain depths used by verify_chain (number of
+    // chained Box3x3 nodes). Each depth produces a per-N set of metrics and
+    // contributes to the verify-cost-vs-N slope. Default sweeps 1, 4, 16, 64
+    // nodes which is enough for a clean linear regression across most impls.
+    std::vector<int> framework_chain_depths = {1, 4, 16, 64};
 };
 
 // Default tensor dimensions for benchmarks
diff --git a/src/benchmarks/framework_benchmarks.cpp b/src/benchmarks/framework_benchmarks.cpp
@@ -237,6 +237,197 @@ BenchmarkResult runGraphDividend(const std::vector<ChainStage>& stages,
     return r;
 }
 
+// Per-N timings collected by runVerifyChain; one of these is produced per
+// chain depth and feeds both the per-N metrics and the slope regression.
+struct VerifySample {
+    int n;                      // chain depth (number of Box3x3 nodes)
+    double create_ms;           // time to vxCreateGraph + add N nodes
+    double verify_ms;           // time of vxVerifyGraph
+    double first_process_ms;    // first vxProcessGraph (lazy alloc included)
+    double steady_process_ms;   // median of subsequent vxProcessGraph calls
+    bool ok;
+};
+
+// Build a graph of N Box3x3 nodes (input -> N-1 virtual intermediates ->
+// output) and return per-phase timings.
+VerifySample timeVerifyChain(vx_context ctx, uint32_t width, uint32_t height,
+                             int n, int warmup, int iterations,
+                             TestDataGenerator& gen) {
+    VerifySample s{};
+    s.n = n;
+    if (n < 1) return s;
+
+    ResourceTracker tracker;
+
+    vx_image input = gen.createFilledImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)input) != VX_SUCCESS) return s;
+    tracker.trackImage(input);
+
+    vx_image output = vxCreateImage(ctx, width, height, VX_DF_IMAGE_U8);
+    if (vxGetStatus((vx_reference)output) != VX_SUCCESS) return s;
+    tracker.trackImage(output);
+
+    BenchmarkTimer timer;
+
+    // Phase 1: graph construction (vxCreateGraph + N node creations).
+    timer.start();
+    vx_graph graph = vxCreateGraph(ctx);
+    if (vxGetStatus((vx_reference)graph) != VX_SUCCESS) return s;
+    tracker.trackGraph(graph);
+
+    vx_image src = input;
+    for (int i = 0; i < n; i++) {
+        bool is_last = (i + 1 == n);
+        vx_image dst = is_last
+            ? output
+            : vxCreateVirtualImage(graph, width, height, VX_DF_IMAGE_U8);
+        if (vxGetStatus((vx_reference)dst) != VX_SUCCESS) return s;
+        if (!is_last) tracker.trackImage(dst);
+
+        vx_node node = vxBox3x3Node(graph, src, dst);
+        if (vxGetStatus((vx_reference)node) != VX_SUCCESS) return s;
+        tracker.trackNode(node);
+
+        src = dst;
+    }
+    timer.stop();
+    s.create_ms = timer.elapsed_ms();
+
+    // Phase 2: vxVerifyGraph. The headline framework metric.
+    timer.start();
+    if (vxVerifyGraph(graph) != VX_SUCCESS) return s;
+    timer.stop();
+    s.verify_ms = timer.elapsed_ms();
+
+    // Phase 3: first vxProcessGraph. Often pays a one-shot tax (lazy
+    // allocation of execution state, kernel JIT, target affinity selection)
+    // beyond the steady-state cost; this number minus the steady median is
+    // a useful "warm-up" signal.
+    timer.start();
+    if (vxProcessGraph(graph) != VX_SUCCESS) return s;
+    timer.stop();
+    s.first_process_ms = timer.elapsed_ms();
+
+    // Phase 4: steady-state. Run cfg.warmup more then take median of
+    // cfg.iterations samples.
+    for (int i = 0; i < warmup; i++) vxProcessGraph(graph);
+    std::vector<double> samples;
+    samples.reserve(iterations);
+    for (int i = 0; i < iterations; i++) {
+        timer.start();
+        if (vxProcessGraph(graph) != VX_SUCCESS) return s;
+        timer.stop();
+        samples.push_back(timer.elapsed_ns());
+    }
+    s.steady_process_ms = BenchmarkStats::compute(samples).median_ns / 1e6;
+    s.ok = true;
+    return s;
+}
+
+// Linear regression over (n, verify_ms) samples returning slope and intercept
+// of verify_ms = intercept + slope * n. Falls back to 0 / first-sample if
+// fewer than 2 points are usable.
+void verifyRegression(const std::vector<VerifySample>& samples,
+                      double& slope_out, double& intercept_out) {
+    slope_out = 0;
+    intercept_out = 0;
+    int count = 0;
+    double sum_x = 0, sum_y = 0, sum_xx = 0, sum_xy = 0;
+    for (const auto& s : samples) {
+        if (!s.ok) continue;
+        double x = static_cast<double>(s.n);
+        double y = s.verify_ms;
+        sum_x += x; sum_y += y;
+        sum_xx += x * x; sum_xy += x * y;
+        count++;
+    }
+    if (count < 2) {
+        if (count == 1) intercept_out = sum_y;
+        return;
+    }
+    double denom = count * sum_xx - sum_x * sum_x;
+    if (denom == 0) return;
+    slope_out = (count * sum_xy - sum_x * sum_y) / denom;
+    intercept_out = (sum_y - slope_out * sum_x) / count;
+}
+
+// Build a chain of N Box3x3 nodes for several N and report per-N create /
+// verify / first-process / steady-process timings, plus regression-derived
+// per-node verify slope and the lazy-alloc overhead at the deepest chain.
+BenchmarkResult runVerifyChain(vx_context ctx, const Resolution& res,
+                               const BenchmarkConfig& cfg) {
+    BenchmarkResult r;
+    r.iterations = cfg.iterations;
+    r.warmup = cfg.warmup;
+
+    const auto& depths = cfg.framework_chain_depths;
+    if (depths.empty()) {
+        r.supported = false;
+        r.skip_reason = "no chain depths configured";
+        return r;
+    }
+
+    TestDataGenerator gen(cfg.seed);
+    std::vector<VerifySample> samples;
+    samples.reserve(depths.size());
+    for (int n : depths) {
+        if (n < 1) continue;
+        VerifySample s = timeVerifyChain(ctx, res.width, res.height, n,
+                                         cfg.warmup, cfg.iterations, gen);
+        if (!s.ok) {
+            r.supported = false;
+            r.skip_reason = "verify chain timing failed at depth " +
+                            std::to_string(n);
+            return r;
+        }
+        samples.push_back(s);
+    }
+
+    // Per-N metrics. Names embed the depth so downstream consumers can pick
+    // them apart trivially.
+    for (const auto& s : samples) {
+        std::string p = "n" + std::to_string(s.n) + "_";
+        r.framework_metrics.push_back({p + "create_ms",
+                                       s.create_ms, "ms", false});
+        r.framework_metrics.push_back({p + "verify_ms",
+                                       s.verify_ms, "ms", false});
+        r.framework_metrics.push_back({p + "first_process_ms",
+                                       s.first_process_ms, "ms", false});
+        r.framework_metrics.push_back({p + "steady_process_ms",
+                                       s.steady_process_ms, "ms", false});
+    }
+
+    // Aggregates: linear-regression slope + intercept of verify cost vs N,
+    // and the first-process overhead at the deepest chain.
+    double slope_ms_per_node = 0, intercept_ms = 0;
+    verifyRegression(samples, slope_ms_per_node, intercept_ms);
+
+    r.framework_metrics.push_back({"verify_per_node_ms",
+                                   slope_ms_per_node, "ms/node", false});
+    r.framework_metrics.push_back({"verify_intercept_ms",
+                                   intercept_ms, "ms", false});
+
+    const auto& deepest = samples.back();
+    double first_overhead = deepest.first_process_ms - deepest.steady_process_ms;
+    if (first_overhead < 0) first_overhead = 0;
+    r.framework_metrics.push_back({"first_process_overhead_ms",
+                                   first_overhead, "ms", false});
+
+    // Surface the deepest-chain steady-state time as the canonical
+    // wall-clock so the row is sortable in scaling/top-N views without
+    // polluting Vision Score (framework results are filtered out there).
+    double primary_ns = deepest.steady_process_ms * 1e6;
+    r.wall_clock.median_ns = primary_ns;
+    r.wall_clock.mean_ns = primary_ns;
+    r.wall_clock.min_ns = primary_ns;
+    r.wall_clock.max_ns = primary_ns;
+    r.wall_clock.sample_count = static_cast<size_t>(cfg.iterations);
+    r.megapixels_per_sec = BenchmarkStats::computeThroughput(
+        res.width, res.height, primary_ns);
+
+    return r;
+}
+
 // Build the canonical "pure framework" chain: 4 Box3x3 nodes back-to-back.
 std::vector<ChainStage> makeBox3x3Chain() {
     ChainStage box;
@@ -325,5 +516,19 @@ std::vector<BenchmarkCase> registerFrameworkBenchmarks() {
         cases.push_back(bc);
     }
 
+    {
+        BenchmarkCase bc;
+        bc.name = "VerifyChain_Box3x3";
+        bc.category = "framework_compile";
+        bc.feature_set = "framework";
+        bc.kernel_enum = VX_KERNEL_BOX_3x3;
+        bc.required_kernels = {VX_KERNEL_BOX_3x3};
+        bc.framework_run = [](vx_context ctx, const Resolution& res,
+                              const BenchmarkConfig& cfg) -> BenchmarkResult {
+            return runVerifyChain(ctx, res, cfg);
+        };
+        cases.push_back(bc);
+    }
+
     return cases;
 }
diff --git a/src/main.cpp b/src/main.cpp
@@ -39,7 +39,8 @@ static void printUsage(const char* prog) {
     printf("  --warmup N                    Warm-up iterations (default: 10)\n");
     printf("  --seed N                      PRNG seed (default: 42)\n");
     printf("  --stability-threshold N       CV%% threshold for stability warning (default: 15)\n");
-    printf("  --max-retries N               Max retries for unstable benchmarks (default: 0)\n\n");
+    printf("  --max-retries N               Max retries for unstable benchmarks (default: 0)\n");
+    printf("  --framework-chain-depths N,N  Chain depths for verify_chain (default: 1,4,16,64)\n\n");
 
     printf("Output:\n");
     printf("  --output-dir DIR              Output directory (default: ./benchmark_results)\n");
@@ -161,6 +162,21 @@ static bool parseArgs(int argc, char* argv[], BenchmarkConfig& config) {
             config.stability_threshold = atof(argv[++i]);
         } else if (arg == "--max-retries" && i + 1 < argc) {
             config.max_retries = atoi(argv[++i]);
+        } else if (arg == "--framework-chain-depths" && i + 1 < argc) {
+            auto depth_strs = splitComma(argv[++i]);
+            config.framework_chain_depths.clear();
+            for (const auto& s : depth_strs) {
+                int n = atoi(s.c_str());
+                if (n > 0) {
+                    config.framework_chain_depths.push_back(n);
+                } else {
+                    printf("WARNING: Invalid chain depth '%s', skipping\n", s.c_str());
+                }
+            }
+            if (config.framework_chain_depths.empty()) {
+                printf("ERROR: No valid framework chain depths specified\n");
+                return false;
+            }
         } else if (arg == "--compare" && i + 1 < argc) {
             config.compare_files = splitComma(argv[++i]);
         } else {