Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch #2749

yucai-intel · 2026-01-19T14:46:58Z

Description
This PR optimizes the indexing_backward kernel on XPU by implementing a specialized aggregation strategy for sorted indices. The primary goal is to minimize global memory contention during gradient accumulation.

Key Optimizations

Duplicate Aggregation & Lookahead: Instead of performing individual atomic updates, the kernel identifies contiguous identical indices using an optimized lookahead mechanism (SKIP_SORTED_INDICES). This collapses multiple redundant updates into a single localized accumulation, significantly reducing atomic contention on grad_weight.
Sub-group Parallel Reduction: For clusters with high duplicate counts, the kernel utilizes sub-group shuffle primitives (shift_group_left) to perform parallel reductions. This ensures that large index blocks are processed across all lanes within a sub-group simultaneously, maximizing compute throughput.
Tiled Stride Optimization: Three specialized kernel variants are introduced to handle different data layouts:
stride_1: Optimized for scalar-like indexing with maximum throughput.
small_stride: Parallelizes across the feature dimension using local work-items.
generic_stride: Handles high-dimensional feature vectors with optimized memory tiling.
SLM-backed Duplicate Cache: Implemented a Shared Local Memory (SLM) cache (smem_dups_cache) to coordinate duplicate counts within a sub-group, reducing redundant global memory fetches for index metadata.

yucai-intel added 2 commits January 19, 2026 22:37

Update Indexing.h

576d6e0

Update Indexing.cpp

10dc518

yucai-intel changed the title ~~Yucai/index/put~~ Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch #2749

Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch #2749

Uh oh!

yucai-intel commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch #2749

Are you sure you want to change the base?

Optimize Indexing Backward Kernel with Sub-group Aggregation and Tailored Stride Dispatch #2749

Uh oh!

Conversation

yucai-intel commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants