Issue: Benchmark Methodology Concerns for Remap Kernel
Summary
After conducting a deep dive analysis of the openvx-mark benchmarking methodology for the Remap kernel, I've identified several methodology concerns that may affect benchmark accuracy and reproducibility. The current implementation uses idealized test conditions that may not represent real-world performance.
Environment
- System: AMD RYZEN AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151)
- ROCm Version: 7.x
- MIVisionX Version: Latest develop branch (as of June 2026)
- Benchmark Tool: openvx-mark v1.1.0
Issues Identified
1. Identity Remap Pattern (Critical)
Location: src/test_data_generator.cpp:createRemap()
Problem: The benchmark uses an identity remap pattern where each destination pixel maps to the corresponding source pixel:
coords[y * dst_w + x].x = static_cast<vx_float32>(x * src_w) / static_cast<vx_float32>(dst_w);
coords[y * dst_w + x].y = static_cast<vx_float32>(y * src_h) / static_cast<vx_float32>(dst_h);
Impact:
- Results in perfectly sequential memory access (highly cache-friendly)
- Real-world remaps (fisheye correction, lens distortion) have scattered access patterns
- GPU implementations may detect and optimize identity remaps
- Measured performance may be 2-5× better than realistic workloads
Recommendation: Add alternative remap patterns:
- Lens distortion (radial/tangential distortion model)
- Random offsets within [-1, +1] pixel range
- Worst-case (full random) access pattern for stress testing
2. Missing Memory Fences (Medium)
Location: src/benchmark_runner.cpp:runGraphMode() and runImmediateMode()
Problem: No explicit memory fence before/after timing calls:
timer.start();
vx_status s = vxProcessGraph(graph); // May be reordered by compiler/CPU
timer.stop();
Impact: Compiler/hardware instruction reordering may affect timing accuracy
Recommendation: Add memory barriers:
std::atomic_thread_fence(std::memory_order_seq_cst);
timer.start();
vx_status s = vxProcessGraph(graph);
std::atomic_thread_fence(std::memory_order_seq_cst);
timer.stop();
3. Graph Reuse Across Iterations (Medium)
Problem: The same graph object is reused across all iterations without reconstruction:
// Warm-up
for (int i = 0; i < config_.warmup; i++) {
vxProcessGraph(graph); // Same graph reused
}
// Measurement
for (int i = 0; i < config_.iterations; i++) {
timer.start();
vxProcessGraph(graph); // Same graph reused
timer.stop();
}
Impact:
- Warm caches may not represent cold-start performance
- Memory allocation costs not amortized in measurement
- JIT-compiled GPU kernels already warm
Recommendation: Option to recreate graph per iteration for cold-cache testing
4. No CPU Affinity (Low)
Problem: No thread pinning or CPU isolation
Impact: Context switches and scheduler effects can add noise to measurements
Recommendation: Add option to pin benchmark thread to specific CPU core
5. Missing Graph Verification in Timing (Low)
Problem: vxVerifyGraph() is called once before warmup but not included in timing
Impact: First-call overhead not captured in measurements
Verification Results
Reproduction Attempt vs Forwarded Analysis
| Backend |
Resolution |
Forwarded MP/s |
Reproduced MP/s |
Variance |
| CPU |
VGA |
1,213.1 |
1,256.8 |
+3.6% ✅ |
| CPU |
FHD |
1,261.5 |
1,239.0 |
-1.8% ✅ |
| HIP |
FHD |
102,157.9 |
88,302.2 |
-13.5% ⚠️ |
- CPU results match within expected variance ✅
- HIP results 8-14% lower - possibly due to identity remap optimization, thermal states, or different ROCm versions
OpenCL Backend Issue
Problem: Remap kernel not supported on OpenCL backend:
Remap SKIPPED (vxVerifyGraph failed (kernel not fully supported))
This means OpenCL results in performance comparisons cannot be reproduced.
Recommendations
High Priority
- Add realistic remap patterns - Identity remaps are not representative of real workloads
- Add memory fences around timing calls
- Document the identity remap assumption in benchmark results
Medium Priority
- Add CPU affinity option
- Add cold vs hot cache testing modes
- Include graph verification in timing (optional)
Documentation
- Clarify that results are "kernel-only" and don't include:
- Graph construction/verification
- Memory allocation
- PCIe transfer overhead (for GPU)
- Realistic memory access patterns
Suggested Test Pattern: Lens Distortion
// Radial distortion model (similar to camera calibration)
float cx = width / 2.0f, cy = height / 2.0f;
float k1 = 0.1f, k2 = 0.01f; // Distortion coefficients
float dx = x - cx;
float dy = y - cy;
float r2 = dx*dx + dy*dy;
float r4 = r2 * r2;
float scale = 1 + k1*r2 + k2*r4;
coords[y*width+x].x = cx + dx * scale;
coords[y*width+x].y = cy + dy * scale;
This would provide more realistic memory access patterns while still being deterministic.
Overall Assessment
Current Methodology Quality: 6.5/10
- ✅ Good statistical rigor (outlier removal, percentiles, CV%)
- ✅ High-resolution timing
- ⚠️ Test data not representative (identity remaps)
- ⚠️ Missing memory fences
- ⚠️ No CPU isolation
The benchmark is directionally correct for comparing backends, but absolute MP/s numbers may be inflated for real-world workloads.
Related
- Issue #1697 (CPU vs GPU performance comparison)
- The forwarded analysis showing "400× speedup" is technically true for kernel-only but may be misleading for real pipelines
Labels: benchmark, methodology, performance, remap, accuracy
Issue: Benchmark Methodology Concerns for Remap Kernel
Summary
After conducting a deep dive analysis of the openvx-mark benchmarking methodology for the Remap kernel, I've identified several methodology concerns that may affect benchmark accuracy and reproducibility. The current implementation uses idealized test conditions that may not represent real-world performance.
Environment
Issues Identified
1. Identity Remap Pattern (Critical)
Location:
src/test_data_generator.cpp:createRemap()Problem: The benchmark uses an identity remap pattern where each destination pixel maps to the corresponding source pixel:
Impact:
Recommendation: Add alternative remap patterns:
2. Missing Memory Fences (Medium)
Location:
src/benchmark_runner.cpp:runGraphMode()andrunImmediateMode()Problem: No explicit memory fence before/after timing calls:
timer.start(); vx_status s = vxProcessGraph(graph); // May be reordered by compiler/CPU timer.stop();Impact: Compiler/hardware instruction reordering may affect timing accuracy
Recommendation: Add memory barriers:
3. Graph Reuse Across Iterations (Medium)
Problem: The same graph object is reused across all iterations without reconstruction:
Impact:
Recommendation: Option to recreate graph per iteration for cold-cache testing
4. No CPU Affinity (Low)
Problem: No thread pinning or CPU isolation
Impact: Context switches and scheduler effects can add noise to measurements
Recommendation: Add option to pin benchmark thread to specific CPU core
5. Missing Graph Verification in Timing (Low)
Problem:
vxVerifyGraph()is called once before warmup but not included in timingImpact: First-call overhead not captured in measurements
Verification Results
Reproduction Attempt vs Forwarded Analysis
OpenCL Backend Issue
Problem: Remap kernel not supported on OpenCL backend:
This means OpenCL results in performance comparisons cannot be reproduced.
Recommendations
High Priority
Medium Priority
Documentation
Suggested Test Pattern: Lens Distortion
This would provide more realistic memory access patterns while still being deterministic.
Overall Assessment
Current Methodology Quality: 6.5/10
The benchmark is directionally correct for comparing backends, but absolute MP/s numbers may be inflated for real-world workloads.
Related
Labels:
benchmark,methodology,performance,remap,accuracy