Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598
Open
stefanatwork wants to merge 1 commit intomasterfrom
Open
Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598stefanatwork wants to merge 1 commit intomasterfrom
stefanatwork wants to merge 1 commit intomasterfrom
Conversation
Targets the SYCL/Xe code paths in the curve intersectors: * curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL. * curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2*length(W)' with the equivalent squared-form comparison 'm*m < 0.04*dot(W,W)', removing one sqrt per recursion step. * curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(__SYCL_DEVICE_ONLY__)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop. * curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.
There was a problem hiding this comment.
Pull request overview
This PR optimizes the curve intersector implementations (especially SYCL/Xe paths) by reducing transcendental operations, minimizing divergence, and improving memory load coalescing to improve Intel iGPU performance.
Changes:
- Reuses a single
rsqrt(dot(dPdu,dPdu))in the iterative Jacobian Bezier intersector to avoid redundant sqrt/rsqrt work. - Rewrites a “well-behaved” recursion criterion into a squared-form comparison to avoid
length()/sqrt()per recursion step. - Reduces SYCL kernel bloat by making the duplicated “first-iteration” prologue CPU-only, and changes SYCL offset/scale loading to encourage a single 16B coalesced transaction.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| kernels/geometry/curveNi_intersector.h | SYCL path loads offset+scale as 4 contiguous floats to improve iGPU load coalescing. |
| kernels/geometry/curve_intersector_sweep.h | Reduces transcendental usage in Jacobian iteration and avoids sqrt in the recursion “well-behaved” test. |
| kernels/geometry/curve_intersector_distance.h | Makes the unrolled first-iteration prologue CPU-only so SYCL falls into the unified loop (smaller kernel / less duplication). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+30
to
+35
| /* offset (12 B) and scale (4 B) live in 16 contiguous bytes; loading | ||
| them as 4 adjacent floats lets the SYCL/IGC compiler emit a single | ||
| coalesced 16-byte load instead of one Vec3f load + one scalar load. */ | ||
| const float* offset_scale = (const float*)prim.offset(N); | ||
| const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]); | ||
| const float scale = offset_scale[3]; |
Comment on lines
+316
to
+319
| /* see CurveNiIntersector1::intersect for rationale of this 4xfloat load */ | ||
| const float* offset_scale = (const float*)prim.offset(N); | ||
| const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]); | ||
| const float scale = offset_scale[3]; |
svenwoop
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Targets the SYCL/Xe code paths in the curve intersectors:
curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL.
curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2length(W)' with the equivalent squared-form comparison 'mm < 0.04*dot(W,W)', removing one sqrt per recursion step.
curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(SYCL_DEVICE_ONLY)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop.
curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.