Skip to content

Benchmark Methodology Concerns: Identity Remap Pattern Inflates Performance Numbers #24

Description

@simonCatBot

Issue: Benchmark Methodology Concerns for Remap Kernel

Summary

After conducting a deep dive analysis of the openvx-mark benchmarking methodology for the Remap kernel, I've identified several methodology concerns that may affect benchmark accuracy and reproducibility. The current implementation uses idealized test conditions that may not represent real-world performance.

Environment

  • System: AMD RYZEN AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151)
  • ROCm Version: 7.x
  • MIVisionX Version: Latest develop branch (as of June 2026)
  • Benchmark Tool: openvx-mark v1.1.0

Issues Identified

1. Identity Remap Pattern (Critical)

Location: src/test_data_generator.cpp:createRemap()

Problem: The benchmark uses an identity remap pattern where each destination pixel maps to the corresponding source pixel:

coords[y * dst_w + x].x = static_cast<vx_float32>(x * src_w) / static_cast<vx_float32>(dst_w);
coords[y * dst_w + x].y = static_cast<vx_float32>(y * src_h) / static_cast<vx_float32>(dst_h);

Impact:

  • Results in perfectly sequential memory access (highly cache-friendly)
  • Real-world remaps (fisheye correction, lens distortion) have scattered access patterns
  • GPU implementations may detect and optimize identity remaps
  • Measured performance may be 2-5× better than realistic workloads

Recommendation: Add alternative remap patterns:

  • Lens distortion (radial/tangential distortion model)
  • Random offsets within [-1, +1] pixel range
  • Worst-case (full random) access pattern for stress testing

2. Missing Memory Fences (Medium)

Location: src/benchmark_runner.cpp:runGraphMode() and runImmediateMode()

Problem: No explicit memory fence before/after timing calls:

timer.start();
vx_status s = vxProcessGraph(graph);  // May be reordered by compiler/CPU
timer.stop();

Impact: Compiler/hardware instruction reordering may affect timing accuracy

Recommendation: Add memory barriers:

std::atomic_thread_fence(std::memory_order_seq_cst);
timer.start();
vx_status s = vxProcessGraph(graph);
std::atomic_thread_fence(std::memory_order_seq_cst);
timer.stop();

3. Graph Reuse Across Iterations (Medium)

Problem: The same graph object is reused across all iterations without reconstruction:

// Warm-up
for (int i = 0; i < config_.warmup; i++) {
    vxProcessGraph(graph);  // Same graph reused
}

// Measurement
for (int i = 0; i < config_.iterations; i++) {
    timer.start();
    vxProcessGraph(graph);  // Same graph reused
    timer.stop();
}

Impact:

  • Warm caches may not represent cold-start performance
  • Memory allocation costs not amortized in measurement
  • JIT-compiled GPU kernels already warm

Recommendation: Option to recreate graph per iteration for cold-cache testing

4. No CPU Affinity (Low)

Problem: No thread pinning or CPU isolation

Impact: Context switches and scheduler effects can add noise to measurements

Recommendation: Add option to pin benchmark thread to specific CPU core

5. Missing Graph Verification in Timing (Low)

Problem: vxVerifyGraph() is called once before warmup but not included in timing

Impact: First-call overhead not captured in measurements

Verification Results

Reproduction Attempt vs Forwarded Analysis

Backend Resolution Forwarded MP/s Reproduced MP/s Variance
CPU VGA 1,213.1 1,256.8 +3.6% ✅
CPU FHD 1,261.5 1,239.0 -1.8% ✅
HIP FHD 102,157.9 88,302.2 -13.5% ⚠️
  • CPU results match within expected variance
  • HIP results 8-14% lower - possibly due to identity remap optimization, thermal states, or different ROCm versions

OpenCL Backend Issue

Problem: Remap kernel not supported on OpenCL backend:

Remap SKIPPED (vxVerifyGraph failed (kernel not fully supported))

This means OpenCL results in performance comparisons cannot be reproduced.

Recommendations

High Priority

  1. Add realistic remap patterns - Identity remaps are not representative of real workloads
  2. Add memory fences around timing calls
  3. Document the identity remap assumption in benchmark results

Medium Priority

  1. Add CPU affinity option
  2. Add cold vs hot cache testing modes
  3. Include graph verification in timing (optional)

Documentation

  1. Clarify that results are "kernel-only" and don't include:
    • Graph construction/verification
    • Memory allocation
    • PCIe transfer overhead (for GPU)
    • Realistic memory access patterns

Suggested Test Pattern: Lens Distortion

// Radial distortion model (similar to camera calibration)
float cx = width / 2.0f, cy = height / 2.0f;
float k1 = 0.1f, k2 = 0.01f;  // Distortion coefficients

float dx = x - cx;
float dy = y - cy;
float r2 = dx*dx + dy*dy;
float r4 = r2 * r2;
float scale = 1 + k1*r2 + k2*r4;

coords[y*width+x].x = cx + dx * scale;
coords[y*width+x].y = cy + dy * scale;

This would provide more realistic memory access patterns while still being deterministic.

Overall Assessment

Current Methodology Quality: 6.5/10

  • ✅ Good statistical rigor (outlier removal, percentiles, CV%)
  • ✅ High-resolution timing
  • ⚠️ Test data not representative (identity remaps)
  • ⚠️ Missing memory fences
  • ⚠️ No CPU isolation

The benchmark is directionally correct for comparing backends, but absolute MP/s numbers may be inflated for real-world workloads.

Related

  • Issue #1697 (CPU vs GPU performance comparison)
  • The forwarded analysis showing "400× speedup" is technically true for kernel-only but may be misleading for real pipelines

Labels: benchmark, methodology, performance, remap, accuracy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions