Skip to content

Conversation

@syurkevi
Copy link
Contributor

@syurkevi syurkevi commented Jan 6, 2026

Description

This is a WIP PR implementing a fused sdpa training kernel. The current bottleneck seems to be the individual gemm ukernels which with their current setup are ~3-5 slower than the primitives based gemms of the same size. Tuning allows for a large variation in runtime, however certain tile sizes seem to be inaccessible due to hardcoded FMA values in microkernel_provider.cpp
Ongoing work is focusing on improving gemmstone strategies to more closely match the performance of standalone gemms. Once those are closer the benefits of the fused kernel can be better realized.

@syurkevi syurkevi requested review from a team as code owners January 6, 2026 22:42
@github-actions github-actions bot added platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel component:tests Codeowner: @oneapi-src/onednn-arch component:common labels Jan 6, 2026
@syurkevi syurkevi marked this pull request as draft January 6, 2026 22:44
@syurkevi syurkevi force-pushed the syurkevi/fused_sdpa_training branch from 2bcb1f5 to adf7f0e Compare January 10, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:common component:tests Codeowner: @oneapi-src/onednn-arch platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant