-
Notifications
You must be signed in to change notification settings - Fork 49
F16 variants - Update loads and stores to AVX2 - Group 6 #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
F16 load store group 6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes FP16 image processing operations by replacing scalar loads/stores with AVX2 vectorized intrinsics. The changes enable processing 24 elements at once with AVX2 (versus 12 with SSE), achieving 28-58% performance gains for PKD3 to PLN3 variants.
Key changes:
- Replaced SSE implementations with AVX2 variants for six augmentations (Copy, Color Jitter, Crop, Gridmask, Ricap, Crop and Patch)
- Direct FP16-to-FP32 AVX2 conversions eliminate intermediate scalar conversion loops
- Unified vectorization constants via conditional compilation for AVX2/SSE paths
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/modules/tensor/cpu/kernel/saturation.cpp | Added AVX2 fast path with direct F16-to-F32 SIMD operations for saturation adjustments across all layout combinations |
| src/modules/tensor/cpu/kernel/ricap.cpp | Implemented AVX2 vectorization for RICAP augmentation with unified vector increment handling |
| src/modules/tensor/cpu/kernel/hue.cpp | Added AVX2 support for hue adjustments with direct F16 conversions; fixed incorrect cast in scalar fallback path |
| src/modules/tensor/cpu/kernel/gridmask.cpp | Introduced AVX2 mask computation functions and integrated them across all gridmask layout paths |
| src/modules/tensor/cpu/kernel/crop_and_patch.cpp | Added AVX2 vectorization for crop-and-patch operations with unified alignment calculations |
| src/modules/tensor/cpu/kernel/crop.cpp | Implemented AVX2 fast path for crop operations across layout toggle scenarios |
| src/modules/tensor/cpu/kernel/copy.cpp | Added AVX2 support for copy operations with direct F16 SIMD loads/stores |
| src/modules/tensor/cpu/kernel/color_jitter.cpp | Implemented AVX2 color jitter computation with proper F16 boundary checking |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else if ((srcDescPtr->c == 3) && (srcDescPtr->layout == RpptLayout::NHWC) && (dstDescPtr->layout == RpptLayout::NHWC)) | ||
| { | ||
| Rpp32u alignedLength = bufferLength & ~3; | ||
| // Rpp32u alignedLength = bufferLength & ~3; |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed rather than left in the codebase. The alignedLength calculation is now handled by the conditional compilation block above.
| // Rpp32u alignedLength = bufferLength & ~3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
| // Rpp32u alignedLength = bufferLength & ~3; | ||
|
|
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed rather than left in the codebase. The alignedLength calculation is now handled by the conditional compilation block above.
| // Rpp32u alignedLength = bufferLength & ~3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
Hue and Saturation SSE Updates - F16 Copilot Fix
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #649 +/- ##
===========================================
+ Coverage 88.16% 88.38% +0.21%
===========================================
Files 195 195
Lines 82723 82420 -303
===========================================
- Hits 72932 72839 -93
+ Misses 9791 9581 -210
🚀 New features to boost your workflow:
|
This PR includes replacement of scalar loads/stores and conversion to FP32, with AVX2 intrinsics - no additions or removals to external user API.