Skip to content

Conversation

@njriasan
Copy link
Contributor

With warp specialization its harder to prove potential code motion that may be desirable. In the gluon kernel the order of operations is:

  1. Max
  2. Update alpha
  3. Update qk/p

We cannot easily match this because the intermediate TMEM operations is introducing barriers between the exp operations. This doesn't directly have an impact on performance, but it may be necessary to enable the ping pong computation more easily.

@njriasan
Copy link
Contributor Author

Note: This PR doesn't seem to have any impact on performance on its own.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants