Skip to content

Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598

Open
stefanatwork wants to merge 1 commit intomasterfrom
sw/curve_optimizations
Open

Curves: reduce transcendentals, divergence, and loads on Intel iGPU#598
stefanatwork wants to merge 1 commit intomasterfrom
sw/curve_optimizations

Conversation

@stefanatwork
Copy link
Copy Markdown
Collaborator

Targets the SYCL/Xe code paths in the curve intersectors:

  • curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL.

  • curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2length(W)' with the equivalent squared-form comparison 'mm < 0.04*dot(W,W)', removing one sqrt per recursion step.

  • curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(SYCL_DEVICE_ONLY)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop.

  • curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.

Targets the SYCL/Xe code paths in the curve intersectors:

* curve_intersector_sweep.h (intersect_bezier_iterative_jacobian): share a single rsqrt(dot(dPdu,dPdu)) between the tangent normalization and the cos_err = P_err/length(dPdu) term, saving one sqrt/rsqrt per Jacobian iteration. Runs on both CPU and SYCL.

* curve_intersector_sweep.h (SYCL intersect_bezier_recursive_jacobian): replace the well-behaved test 'm < 0.2*length(W)' with the equivalent squared-form comparison 'm*m < 0.04*dot(W,W)', removing one sqrt per recursion step.

* curve_intersector_distance.h (DistanceCurve1Intersector1::intersect): the duplicated first-iteration prologue is now CPU-only (#if !defined(__SYCL_DEVICE_ONLY__)). On SYCL where W=1 the duplication only inflated kernel size; control flow now falls straight into the unified loop.

* curveNi_intersector.h (CurveNiIntersector1 and CurveNiIntersectorK SYCL paths): load offset (12 B) and scale (4 B) as 4 contiguous floats so IGC can coalesce them into a single 16-byte transaction instead of one Vec3f load plus a separate scalar float load.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the curve intersector implementations (especially SYCL/Xe paths) by reducing transcendental operations, minimizing divergence, and improving memory load coalescing to improve Intel iGPU performance.

Changes:

  • Reuses a single rsqrt(dot(dPdu,dPdu)) in the iterative Jacobian Bezier intersector to avoid redundant sqrt/rsqrt work.
  • Rewrites a “well-behaved” recursion criterion into a squared-form comparison to avoid length()/sqrt() per recursion step.
  • Reduces SYCL kernel bloat by making the duplicated “first-iteration” prologue CPU-only, and changes SYCL offset/scale loading to encourage a single 16B coalesced transaction.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
kernels/geometry/curveNi_intersector.h SYCL path loads offset+scale as 4 contiguous floats to improve iGPU load coalescing.
kernels/geometry/curve_intersector_sweep.h Reduces transcendental usage in Jacobian iteration and avoids sqrt in the recursion “well-behaved” test.
kernels/geometry/curve_intersector_distance.h Makes the unrolled first-iteration prologue CPU-only so SYCL falls into the unified loop (smaller kernel / less duplication).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +30 to +35
/* offset (12 B) and scale (4 B) live in 16 contiguous bytes; loading
them as 4 adjacent floats lets the SYCL/IGC compiler emit a single
coalesced 16-byte load instead of one Vec3f load + one scalar load. */
const float* offset_scale = (const float*)prim.offset(N);
const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]);
const float scale = offset_scale[3];
Comment on lines +316 to +319
/* see CurveNiIntersector1::intersect for rationale of this 4xfloat load */
const float* offset_scale = (const float*)prim.offset(N);
const Vec3fa offset(offset_scale[0], offset_scale[1], offset_scale[2]);
const float scale = offset_scale[3];
Comment thread kernels/geometry/curve_intersector_sweep.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants