Curves: reduce transcendentals, divergence, and loads on Intel iGPU by stefanatwork · Pull Request #598 · RenderKit/embree

stefanatwork · 2026-05-06T07:14:59Z

Targets the SYCL/Xe code paths in the curve intersectors:

curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL.
curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2length(W)' with the equivalent squared-form comparison 'mm < 0.04*dot(W,W)', removing one sqrt per recursion step.
curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(SYCL_DEVICE_ONLY)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop.
curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.

Targets the SYCL/Xe code paths in the curve intersectors: * curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL. * curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2*length(W)' with the equivalent squared-form comparison 'm*m < 0.04*dot(W,W)', removing one sqrt per recursion step. * curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(__SYCL_DEVICE_ONLY__)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop. * curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.

Copilot

Pull request overview

This PR optimizes the curve intersector implementations (especially SYCL/Xe paths) by reducing transcendental operations, minimizing divergence, and improving memory load coalescing to improve Intel iGPU performance.

Changes:

Reuses a single rsqrt(dot(dPdu,dPdu)) in the iterative Jacobian Bezier intersector to avoid redundant sqrt/rsqrt work.
Rewrites a “well-behaved” recursion criterion into a squared-form comparison to avoid length()/sqrt() per recursion step.
Reduces SYCL kernel bloat by making the duplicated “first-iteration” prologue CPU-only, and changes SYCL offset/scale loading to encourage a single 16B coalesced transaction.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
kernels/geometry/curveNi_intersector.h	SYCL path loads `offset`+`scale` as 4 contiguous floats to improve iGPU load coalescing.
kernels/geometry/curve_intersector_sweep.h	Reduces transcendental usage in Jacobian iteration and avoids `sqrt` in the recursion “well-behaved” test.
kernels/geometry/curve_intersector_distance.h	Makes the unrolled first-iteration prologue CPU-only so SYCL falls into the unified loop (smaller kernel / less duplication).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        /* offset (12 B) and scale (4 B) live in 16 contiguous bytes; loading
+           them as 4 adjacent floats lets the SYCL/IGC compiler emit a single
+           coalesced 16-byte load instead of one Vec3f load + one scalar load. */
+        const float* offset_scale = (const float*)prim.offset(N);
+        const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]);
+        const float scale = offset_scale[3];


+        /* see CurveNiIntersector1::intersect for rationale of this 4xfloat load */
+        const float* offset_scale = (const float*)prim.offset(N);
+        const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]);
+        const float scale = offset_scale[3];


stefanatwork requested review from Copilot and svenwoop May 6, 2026 07:15

Copilot started reviewing on behalf of stefanatwork May 6, 2026 07:16 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

svenwoop approved these changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598

Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598
stefanatwork wants to merge 1 commit intomasterfrom
sw/curve_optimizations

stefanatwork commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stefanatwork commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants