-
Notifications
You must be signed in to change notification settings - Fork 455
Open
0 / 40 of 4 issues completedLabels
Description
Functionality improvements
- [P0] Dynamic per group padding/unpadding kernels
- Needed as short/medium term solution so users have performant-ish way to pad token groups to multiples of mxfp8 scaling block size (32), which is required to use the differentiable mxfp8 grouped mm. Using naive torch padding requires d2h sync and kills perf, so custom kernels are needed, which causes usage friction.
- Status: Done 🟢
Performance improvements
- [P0] Fused all2all dispatch + padding kernel
- Needed to avoid the extra copy incurred by the standalone padding kernel described above, which hurt our speedup. The benefit of this approach is that, with the all-to-all dispatch, the receiver ranks are already going to be allocating a buffer for the incoming tokens. If we write those tokens to locations aligned with multiples of 32, we avoid the need for this expensive extra copy.
- While we're doing this, we can also write incoming tokens grouped by local expert, instead of grouped by remote/source rank, in order to avoid the token shuffle kernel step
- Status: WIP 🟡
- Owner: @danielvegamyhre
- [P1] Faster 3d weight quantization kernel for backward pass dgrad computation with RCEIL scaling, that writes scales directly to ((32,4),4) layout for tcgen05 mma
- Current: ~5 tb/s, goal: 6.4 tbs
- Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout
- Status: WIP 🟡
- Owner: @alexsamardzic
- [P1] Faster dim0 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma
- Current: ~5.5 tb/s, Goal: ~6.4 tb/s.
- Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout
- Status: Not started 🔴
- Owner: None
- [P1] dim1 quantization kernel with RCEIL scaling that writes scales directly to ((32,4),4) layout for tcgen05 mma (current: 5.9 tb/s but writes scales in row major, requires extra kernel for per group blocked layout)
- Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout
- Status: Not started 🔴
- Owner: None
- Currently writes scales in row major, requires additional lightweight kernel for per group blocked layout
Reactions are currently unavailable