Benchmark Methodology Concerns: Identity Remap Pattern Inflates Performance Numbers

# Issue: Benchmark Methodology Concerns for Remap Kernel

## Summary

After conducting a deep dive analysis of the openvx-mark benchmarking methodology for the Remap kernel, I've identified several methodology concerns that may affect benchmark accuracy and reproducibility. The current implementation uses idealized test conditions that may not represent real-world performance.

## Environment

- **System:** AMD RYZEN AI MAX+ PRO 395 w/ Radeon 8060S (gfx1151)
- **ROCm Version:** 7.x
- **MIVisionX Version:** Latest develop branch (as of June 2026)
- **Benchmark Tool:** openvx-mark v1.1.0

## Issues Identified

### 1. Identity Remap Pattern (Critical)

**Location:** `src/test_data_generator.cpp:createRemap()`

**Problem:** The benchmark uses an **identity remap** pattern where each destination pixel maps to the corresponding source pixel:

```cpp
coords[y * dst_w + x].x = static_cast<vx_float32>(x * src_w) / static_cast<vx_float32>(dst_w);
coords[y * dst_w + x].y = static_cast<vx_float32>(y * src_h) / static_cast<vx_float32>(dst_h);
```

**Impact:**
- Results in perfectly sequential memory access (highly cache-friendly)
- Real-world remaps (fisheye correction, lens distortion) have scattered access patterns
- GPU implementations may detect and optimize identity remaps
- **Measured performance may be 2-5× better than realistic workloads**

**Recommendation:** Add alternative remap patterns:
- Lens distortion (radial/tangential distortion model)
- Random offsets within [-1, +1] pixel range
- Worst-case (full random) access pattern for stress testing

### 2. Missing Memory Fences (Medium)

**Location:** `src/benchmark_runner.cpp:runGraphMode()` and `runImmediateMode()`

**Problem:** No explicit memory fence before/after timing calls:

```cpp
timer.start();
vx_status s = vxProcessGraph(graph);  // May be reordered by compiler/CPU
timer.stop();
```

**Impact:** Compiler/hardware instruction reordering may affect timing accuracy

**Recommendation:** Add memory barriers:
```cpp
std::atomic_thread_fence(std::memory_order_seq_cst);
timer.start();
vx_status s = vxProcessGraph(graph);
std::atomic_thread_fence(std::memory_order_seq_cst);
timer.stop();
```

### 3. Graph Reuse Across Iterations (Medium)

**Problem:** The same graph object is reused across all iterations without reconstruction:

```cpp
// Warm-up
for (int i = 0; i < config_.warmup; i++) {
    vxProcessGraph(graph);  // Same graph reused
}

// Measurement
for (int i = 0; i < config_.iterations; i++) {
    timer.start();
    vxProcessGraph(graph);  // Same graph reused
    timer.stop();
}
```

**Impact:**
- Warm caches may not represent cold-start performance
- Memory allocation costs not amortized in measurement
- JIT-compiled GPU kernels already warm

**Recommendation:** Option to recreate graph per iteration for cold-cache testing

### 4. No CPU Affinity (Low)

**Problem:** No thread pinning or CPU isolation

**Impact:** Context switches and scheduler effects can add noise to measurements

**Recommendation:** Add option to pin benchmark thread to specific CPU core

### 5. Missing Graph Verification in Timing (Low)

**Problem:** `vxVerifyGraph()` is called once before warmup but not included in timing

**Impact:** First-call overhead not captured in measurements

## Verification Results

### Reproduction Attempt vs Forwarded Analysis

| Backend | Resolution | Forwarded MP/s | Reproduced MP/s | Variance |
|---------|------------|----------------|-----------------|------------|
| CPU | VGA | 1,213.1 | 1,256.8 | +3.6% ✅ |
| CPU | FHD | 1,261.5 | 1,239.0 | -1.8% ✅ |
| HIP | FHD | 102,157.9 | 88,302.2 | -13.5% ⚠️ |

- **CPU results match within expected variance** ✅
- **HIP results 8-14% lower** - possibly due to identity remap optimization, thermal states, or different ROCm versions

### OpenCL Backend Issue

**Problem:** Remap kernel not supported on OpenCL backend:
```
Remap SKIPPED (vxVerifyGraph failed (kernel not fully supported))
```

This means OpenCL results in performance comparisons cannot be reproduced.

## Recommendations

### High Priority

1. **Add realistic remap patterns** - Identity remaps are not representative of real workloads
2. **Add memory fences** around timing calls
3. **Document the identity remap assumption** in benchmark results

### Medium Priority

4. **Add CPU affinity option**
5. **Add cold vs hot cache testing modes**
6. **Include graph verification in timing** (optional)

### Documentation

7. **Clarify that results are "kernel-only"** and don't include:
   - Graph construction/verification
   - Memory allocation
   - PCIe transfer overhead (for GPU)
   - Realistic memory access patterns

## Suggested Test Pattern: Lens Distortion

```cpp
// Radial distortion model (similar to camera calibration)
float cx = width / 2.0f, cy = height / 2.0f;
float k1 = 0.1f, k2 = 0.01f;  // Distortion coefficients

float dx = x - cx;
float dy = y - cy;
float r2 = dx*dx + dy*dy;
float r4 = r2 * r2;
float scale = 1 + k1*r2 + k2*r4;

coords[y*width+x].x = cx + dx * scale;
coords[y*width+x].y = cy + dy * scale;
```

This would provide more realistic memory access patterns while still being deterministic.

## Overall Assessment

**Current Methodology Quality: 6.5/10**

- ✅ Good statistical rigor (outlier removal, percentiles, CV%)
- ✅ High-resolution timing
- ⚠️ Test data not representative (identity remaps)
- ⚠️ Missing memory fences
- ⚠️ No CPU isolation

The benchmark is **directionally correct** for comparing backends, but absolute MP/s numbers may be inflated for real-world workloads.

## Related

- Issue #1697 (CPU vs GPU performance comparison)
- The forwarded analysis showing "400× speedup" is technically true for kernel-only but may be misleading for real pipelines

---

**Labels:** `benchmark`, `methodology`, `performance`, `remap`, `accuracy`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Methodology Concerns: Identity Remap Pattern Inflates Performance Numbers #24

Issue: Benchmark Methodology Concerns for Remap Kernel

Summary

Environment

Issues Identified

1. Identity Remap Pattern (Critical)

2. Missing Memory Fences (Medium)

3. Graph Reuse Across Iterations (Medium)

4. No CPU Affinity (Low)

5. Missing Graph Verification in Timing (Low)

Verification Results

Reproduction Attempt vs Forwarded Analysis

OpenCL Backend Issue

Recommendations

High Priority

Medium Priority

Documentation

Suggested Test Pattern: Lens Distortion

Overall Assessment

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Backend	Resolution	Forwarded MP/s	Reproduced MP/s	Variance
CPU	VGA	1,213.1	1,256.8	+3.6% ✅
CPU	FHD	1,261.5	1,239.0	-1.8% ✅
HIP	FHD	102,157.9	88,302.2	-13.5% ⚠️

Benchmark Methodology Concerns: Identity Remap Pattern Inflates Performance Numbers #24

Description

Issue: Benchmark Methodology Concerns for Remap Kernel

Summary

Environment

Issues Identified

1. Identity Remap Pattern (Critical)

2. Missing Memory Fences (Medium)

3. Graph Reuse Across Iterations (Medium)

4. No CPU Affinity (Low)

5. Missing Graph Verification in Timing (Low)

Verification Results

Reproduction Attempt vs Forwarded Analysis

OpenCL Backend Issue

Recommendations

High Priority

Medium Priority

Documentation

Suggested Test Pattern: Lens Distortion

Overall Assessment

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions