Skip to content

perf(scale): optimize ScaleImage with precomputed weights and slice access#32

Closed
simonCatBot wants to merge 3 commits into
kiritigowda:mainfrom
simonCatBot:perf/under-2x
Closed

perf(scale): optimize ScaleImage with precomputed weights and slice access#32
simonCatBot wants to merge 3 commits into
kiritigowda:mainfrom
simonCatBot:perf/under-2x

Conversation

@simonCatBot

Copy link
Copy Markdown
Collaborator

Optimizes ScaleImage_Half and ScaleImage_Double for AMD EPYC Zen 3.

Changes:

  • Replace per-pixel f32 math with precomputed fixed-point weights (Q8)
  • Precompute source x positions for nearest-neighbor
  • Use direct slice access in inner loops (no get_pixel bounds checks)
  • Wire fast paths in vxu_scale_image_impl for full-region U8 images

Benchmarks (expected from PR #31 baseline):

Kernel Before After Speedup
ScaleImage_Half 1.64x TBD TBD
ScaleImage_Double 1.67x TBD TBD

Testing:

  • cargo test --release passes (pre-existing warp_affine failure)

Kiriti added 2 commits May 14, 2026 00:17
…ccess

- Add scale_image_nearest() with precomputed source x positions
- Add scale_image_bilinear() with Q8 fixed-point weights
- Wire fast paths in vxu_scale_image_impl for full-region U8 images
- Avoid per-pixel f32 math in hot loops
- Phase: use f32 atan2 instead of f64, direct slice access for S16 data
- NonLinearFilter: fast path for 3x3 all-ones mask with Min/Max/Median sorting network
- Add NonLinearMode enum and nonlinear_filter_3x3 to kernel_fast_paths
simonCatBot pushed a commit to simonCatBot/rustVX that referenced this pull request May 14, 2026
…lter fast paths

Move fast-path dispatch logic into kernel_fast_paths.rs thin wrappers
so vxu_impl.rs only adds ~13 lines total. Avoids .text bloat that
caused Subtract (0.648x) and Phase (0.808x) regressions in perf gate.

- try_scale_image_fast() — 1-line call in vxu_scale_image_impl
- try_phase_fast() — 3-line call in phase_s16
- try_nonlinear_filter_3x3_fast() — 4-line call in vxu_non_linear_filter_impl

All fast paths still gated on Gray format, full valid rect, non-Constant
border, and (for non-linear) 3x3 all-ones mask.
@simonCatBot

Copy link
Copy Markdown
Collaborator Author

Superseded by the approach in PR #34 (now closed). The original branch has build issues (duplicate median9 definition) and the clean rebase also hit perf gate layout shift problems. Will batch these optimizations into a future mega-PR.

@simonCatBot simonCatBot deleted the perf/under-2x branch May 27, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants