Skip to content

[jit kernel] Support per_token_group_quant_8bit jit kernel#18905

Open
yuan-luo wants to merge 4 commits intosgl-project:mainfrom
yuan-luo:jit_per_token_group_quant
Open

[jit kernel] Support per_token_group_quant_8bit jit kernel#18905
yuan-luo wants to merge 4 commits intosgl-project:mainfrom
yuan-luo:jit_per_token_group_quant

Conversation

@yuan-luo
Copy link
Collaborator

@yuan-luo yuan-luo commented Feb 16, 2026

Motivation

Support per_token_group_quant_8bit jit kernel.
Main:
image

PR:
image
image

UT:

root@c7e9bb6a6789:/sgl-workspace/sglang_dev3# python ./python/sglang/jit_kernel/tests/test_per_token_group_quant_8bit.py
[2026-02-17 02:49:12] INFO utils.py:148: Note: detected 224 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2026-02-17 02:49:12] INFO utils.py:151: Note: NumExpr detected 224 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
[2026-02-17 02:49:12] INFO utils.py:164: NumExpr defaulting to 16 threads.
========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /sgl-workspace/sglang_dev3/python
configfile: pyproject.toml
plugins: anyio-4.12.1, asyncio-1.3.0, typeguard-4.4.4
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1872 items                                                                                                                                                                                                                  

python/sglang/jit_kernel/tests/test_per_token_group_quant_8bit.py ...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s. [  8%]
..s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s [ 20%]
...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s... [ 32%]
s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s.. [ 44%]
.s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s. [ 56%]
..s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s [ 67%]
...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s... [ 79%]
s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s...s.. [ 91%]
.s...s...s...s...s...s...s...s...s...s...sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss                                                                      [100%]

========================================================================================================== warnings summary ===========================================================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1303
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1303: PytestAssertRewriteWarning: Module already imported so cannot be rewritten; anyio
    self._mark_plugins_for_rewrite(hook, disable_autoload)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================ 1320 passed, 552 skipped, 1 warning in 4.46s =============================================================================================

Benchmark:

root@c7e9bb6a6789:/sgl-workspace/sglang_dev3# python python/sglang/jit_kernel/benchmark/bench_per_token_group_quant_8bit.py 

......
per-token-group-quant-8bit-performance:
     num_tokens  hidden_dim  group_size num_ranks            dst_dtype                                                                                                                                         flags  Triton (Inaccurate)  SGL Kernel
0             1        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.043       2.396
1             1        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.050       2.412
2             1        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.062       2.531
3             1        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.098       2.693
4             1        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.128       2.557
5             1        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.187       2.547
6             1        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.143       2.722
7             1        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.183       2.823
8             1       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.167       2.609
9             1       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.190       2.603
10            1       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.175       2.785
11            1       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.247       2.937
12            4        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.097       2.599
13            4        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.136       2.730
14            4        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.112       2.777
15            4        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.198       2.883
16            4        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.209       2.639
17            4        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.263       2.803
18            4        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.249       2.768
19            4        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.290       2.885
20            4       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.411       2.707
21            4       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.455       2.903
22            4       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.414       2.944
23            4       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.498       2.995
24           16        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.189       2.616
25           16        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.227       2.811
26           16        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.207       2.762
27           16        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.300       2.868
28           16        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.485       2.779
29           16        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.540       2.957
30           16        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.509       2.963
31           16        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.596       3.082
32           16       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.172       2.972
33           16       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.208       3.166
34           16       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.212       3.198
35           16       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.313       3.327
36           64        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.445       2.779
37           64        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.479       2.920
38           64        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.479       2.954
39           64        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                2.526       3.057
40           64        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                4.086       3.709
41           64        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                4.161       3.847
42           64        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                4.169       3.890
43           64        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                4.221       3.988
44           64       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.856       5.435
45           64       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.954       5.718
46           64       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                6.957       5.650
47           64       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                7.015       5.698
48          256        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.813       3.466
49          256        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.842       3.721
50          256        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.853       3.702
51          256        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                3.915       3.812
52          256        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               10.527       7.749
53          256        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               10.635       9.002
54          256        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               10.654       8.930
55          256        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               10.699       8.038
56          256       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               21.622      14.800
57          256       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               21.745      15.989
58          256       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               21.727      15.920
59          256       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               21.815      15.004
60          768        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                7.477       5.842
61          768        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                7.564       6.166
62          768        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                7.575       6.143
63          768        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}                7.663       6.107
64          768        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               27.809      18.694
65          768        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               27.879      19.604
66          768        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               27.871      19.564
67          768        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               27.975      18.907
68          768       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.065      39.095
69          768       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.133      40.392
70          768       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.171      40.326
71          768       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.216      39.018
72         2048        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               16.693      11.638
73         2048        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               16.825      12.417
74         2048        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               16.806      12.367
75         2048        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               16.892      11.952
76         2048        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               70.897      45.047
77         2048        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               71.005      46.228
78         2048        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               70.993      46.307
79         2048        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               71.115      45.015
80         2048       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              159.583      99.404
81         2048       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              159.702     101.584
82         2048       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              159.687     101.807
83         2048       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              159.787      98.211
84         8192        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.034      39.023
85         8192        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.146      40.125
86         8192        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.160      40.141
87         8192        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}               61.249      38.938
88         8192        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              277.850     171.682
89         8192        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              277.943     175.706
90         8192        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              277.960     175.763
91         8192        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              277.772     169.681
92         8192       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              632.665     389.103
93         8192       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              632.763     397.270
94         8192       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              632.773     397.362
95         8192       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              632.867     383.703
96        16384        1536         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              120.134      75.228
97        16384        1536         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              120.267      77.015
98        16384        1536         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              120.291      76.906
99        16384        1536         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              120.349      74.654
100       16384        7168         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              553.748     340.575
101       16384        7168         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              553.890     347.350
102       16384        7168         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              553.909     347.446
103       16384        7168         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}              554.074     335.985
104       16384       16384         128      None  torch.float8_e4m3fn      {'column_major_scales': False, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1263.000     774.731
105       16384       16384         128      None  torch.float8_e4m3fn       {'column_major_scales': True, 'scale_tma_aligned': False, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1264.000     790.398
106       16384       16384         128      None  torch.float8_e4m3fn        {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': False, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1264.000     790.300
107       16384       16384         128      None  torch.float8_e4m3fn         {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': False, 'masked_layout_mode': None}             1264.000     763.256
108           8        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.615       2.908
109           8        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               13.015    1289.000
110           8        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               13.067    1289.000
111           8        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               13.057    1289.000
112           8        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.655       2.910
113           8        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                7.509    1289.000
114           8        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                7.461    1289.000
115           8        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                7.433    1289.000
116           8        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.635       2.919
117           8        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                4.678    1288.000
118           8        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                4.772    1289.000
119           8        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                4.749    1289.000
120           8        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.648       2.912
121           8        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                3.844    1289.000
122           8        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                3.755    1289.000
123           8        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                3.743    1288.000
124          32        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.831       3.019
125          32        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               12.997    1289.000
126          32        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               13.066    1289.000
127          32        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               13.217    1289.000
128          32        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.852       3.028
129          32        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                7.518    1289.000
130          32        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                7.466    1289.000
131          32        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                7.633    1289.000
132          32        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.819       3.034
133          32        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                4.682    1289.000
134          32        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                4.778    1289.000
135          32        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                4.874    1289.000
136          32        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                2.854       3.052
137          32        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                3.801    1288.000
138          32        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                3.773    1289.000
139          32        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                3.933    1289.000
140         512        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                6.558       5.671
141         512        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               13.266    1289.000
142         512        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               13.509    1289.000
143         512        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               13.154    1289.000
144         512        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                6.581       5.651
145         512        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                8.337    1289.000
146         512        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                7.694    1289.000
147         512        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                9.591    1289.000
148         512        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                6.570       5.669
149         512        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                6.062    1289.000
150         512        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.782    1289.000
151         512        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                9.003    1289.000
152         512        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}                6.603       5.671
153         512        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}                5.882    1289.000
154         512        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}                6.634    1289.000
155         512        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}                8.805    1288.000
156        2048        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               15.751      15.011
157        2048        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               14.840    1289.000
158        2048        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               16.473    1289.000
159        2048        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               29.118    1289.000
160        2048        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               15.806      14.940
161        2048        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               13.846    1289.000
162        2048        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               17.267    1289.000
163        2048        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               27.476    1289.000
164        2048        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               15.696      14.935
165        2048        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               14.534    1288.000
166        2048        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               15.711    1289.000
167        2048        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               26.902    1289.000
168        2048        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               15.731      15.021
169        2048        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               13.742    1288.000
170        2048        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               23.188    1288.000
171        2048        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               26.604    1288.000
172        6144        2048         128         8  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               39.847      38.908
173        6144        2048         128         8  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               31.793    1289.000
174        6144        2048         128         8  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               34.743    1289.000
175        6144        2048         128         8  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               77.850    1289.000
176        6144        2048         128        16  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               39.926      38.914
177        6144        2048         128        16  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               31.172    1289.000
178        6144        2048         128        16  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               39.292    1289.000
179        6144        2048         128        16  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               75.004    1288.000
180        6144        2048         128        32  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               39.824      39.066
181        6144        2048         128        32  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               35.384    1289.000
182        6144        2048         128        32  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               60.427    1288.000
183        6144        2048         128        32  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               74.285    1289.000
184        6144        2048         128        48  torch.float8_e4m3fn          {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': None}               39.914      38.950
185        6144        2048         128        48  torch.float8_e4m3fn    {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'balanced'}               32.346    1289.000
186        6144        2048         128        48  torch.float8_e4m3fn  {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'imbalanced'}               62.767    1289.000
187        6144        2048         128        48  torch.float8_e4m3fn     {'column_major_scales': True, 'scale_tma_aligned': True, 'scale_ue8m0': True, 'fuse_silu_and_mul': True, 'masked_layout_mode': 'extreme'}               74.102    1289.000

Modifications

Accuracy Tests

DeepSeek V3.2

➜  sglang git:(main) python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja

Main:

➜  sglang git:(main) python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000
Downloading from https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl to /tmp/test.jsonl
/tmp/test.jsonl: 732kB [00:00, 14.1MB/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:22<00:00,  8.75it/s]
Accuracy: 0.985
Invalid: 0.000
Latency: 22.849 s
Output throughput: 862.559 token/s

PR:

➜  sglang_dev git:(jit_per_token_group_quant) python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:18<00:00, 11.00it/s]
Accuracy: 0.985
Invalid: 0.000
Latency: 18.186 s
Output throughput: 1088.236 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the quant LLM Quantization label Feb 16, 2026
@yuan-luo yuan-luo marked this pull request as draft February 16, 2026 17:11
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational support for per-token group 8-bit quantization within the JIT kernel framework. It provides a new CUDA kernel and its Python binding, enabling more efficient handling of quantized tensors. The changes aim to enhance performance for specific quantization strategies, although the implementation is currently a Work In Progress with known bugs that are being addressed.

Highlights

  • New JIT Kernel for 8-bit Quantization: Introduced a new CUDA JIT kernel, per_token_group_quant_8bit_kernel, designed for efficient per-token group 8-bit quantization, including support for different scale types (float and UE8M0 uint8_t) and column-major scale handling.
  • Python Wrapper: Added a Python interface (per_token_group_quant_8bit) to load and execute the newly implemented CUDA kernel, integrating it into the SGLang JIT kernel framework.
  • Comprehensive Testing: Included a new test suite (test_per_token_group_quant_8bit.py) that validates the new kernel's functionality across various configurations (e.g., num_tokens, hidden_dim, group_size, dst_dtype, and different flag combinations) by comparing its output against an existing Triton implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/jit_kernel/csrc/gemm/per_token_group_quant_8bit.cuh
    • Added a new CUDA kernel for per-token group 8-bit quantization.
    • Implemented logic for calculating and applying quantization scales, including support for UE8M0 format.
    • Included handling for both row-major and column-major scale layouts.
  • python/sglang/jit_kernel/per_token_group_quant_8bit.py
    • Added a Python function to expose the per_token_group_quant_8bit CUDA kernel.
    • Integrated the kernel loading mechanism using load_jit and cache_once.
  • python/sglang/jit_kernel/tests/test_per_token_group_quant_8bit.py
    • Added a new pytest file to test the per_token_group_quant_8bit kernel.
    • Configured a wide range of test parameters to cover various input shapes, group sizes, and quantization flags.
    • Implemented assertions to compare the SGLang kernel's output with a reference Triton implementation.
Activity
  • The pull request is marked as 'Work In Progress' (WIP).
  • The author has indicated that there are still some bugs to resolve.
  • The checklist for code formatting, unit tests, documentation, and benchmarking is not yet complete, suggesting further work is needed before merging.
  • No CI tests have been triggered or passed yet, and no approvals have been received.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a JIT kernel for per-token group 8-bit quantization. The implementation is a work in progress and contains several issues that need to be addressed. I've identified critical compilation errors in the CUDA kernel, a significant performance issue due to redundant global memory access, and critical bugs in both the Python wrapper and the test suite. The Python wrapper incorrectly flattens tensors, leading to loss of essential shape information for the kernel. The test suite is not correctly invoking the new kernel, fails to test all code paths (like column-major layout), and will crash in its current state. My review includes detailed feedback and code suggestions to fix these problems.

@yuan-luo yuan-luo changed the title [jit kernel][WIP] Support per_token_group_quant_8bit jit kernel [jit kernel] Support per_token_group_quant_8bit jit kernel Feb 17, 2026
@yuan-luo yuan-luo marked this pull request as ready for review February 17, 2026 02:52
@yuan-luo
Copy link
Collaborator Author

/tag-and-rerun-ci

@yuan-luo
Copy link
Collaborator Author

@BBuf @DarkSharpness This PR is ready to review. Could you please help to review it? Thanks.

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the jit_per_token_group_quant branch from c3f5220 to b4f1995 Compare February 17, 2026 08:32
@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

3 similar comments
@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@BBuf
Copy link
Collaborator

BBuf commented Feb 18, 2026

Can you also provide a comparison with the AOT version from sgl-kernel in the benchmark script?

@BBuf
Copy link
Collaborator

BBuf commented Feb 18, 2026

And we also need a new apply and gsm8k acc test comparison.

@yuan-luo yuan-luo force-pushed the jit_per_token_group_quant branch 2 times, most recently from a84c129 to deeea98 Compare February 18, 2026 11:11
@yuan-luo
Copy link
Collaborator Author

And we also need a new apply and gsm8k acc test comparison.

Updated DeepSeek V3.2 gsm8k. No drops.

@yuan-luo
Copy link
Collaborator Author

Can you also provide a comparison with the AOT version from sgl-kernel in the benchmark script?

Performance has no drops than sgl-kernel according to the gsm8k test's result.
Given sgl-kernel will be obsoleted, perfer not to add it into the benchmark script.

@yuan-luo yuan-luo force-pushed the jit_per_token_group_quant branch from deeea98 to 78477c9 Compare February 18, 2026 11:37
@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the jit_per_token_group_quant branch from 78477c9 to a4cd6a8 Compare February 18, 2026 13:49
@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

6 similar comments
@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Collaborator Author

The failed CI also failed in main.

ResourceWarning: Enable tracemalloc to get the object allocation traceback
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
config.json: 1.37kB [00:00, 8.62MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 216/216 [00:00<00:00, 2.17MB/s]
preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 3.76MB/s]
tokenizer_config.json: 5.70kB [00:00, 31.9MB/s]
vocab.json: 2.78MB [00:00, 9.18MB/s]
merges.txt: 1.67MB [00:00, 6.24MB/s]
tokenizer.json: 7.03MB [00:00, 20.6MB/s]
chat_template.json: 1.05kB [00:00, 7.15MB/s]
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.98G/3.98G [00:03<00:00, 1.28GB/s]
model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.53G/3.53G [00:02<00:00, 1.47GB/s]
model.safetensors.index.json: 65.4kB [00:00, 200MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.30it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.28it/s]

Compiling num tokens (num_tokens=8192):   0%|                                                                                                                                                                                 | 0/58 [00:00<?, ?it/s]/usr/local/lib/python3.12/dist-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
Compiling num tokens (num_tokens=4): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:20<00:00,  2.82it/s]
Capturing num tokens (num_tokens=4 avail_mem=21.20 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:06<00:00,  8.92it/s]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:123: UserWarning: resource_tracker: process died unexpectedly, relaunching.  Some resources might leak.
  warnings.warn('resource_tracker: process died unexpectedly, '
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.11it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.12it/s]

.
======================================================================
ERROR: setUpClass (__main__.TestPiecewiseCudaGraphDeepSeek)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sgl-workspace/sglang_dev/./test/registered/cuda_graph/test_piecewise_cuda_graph_small_1_gpu.py", line 110, in setUpClass
    cls.process = popen_launch_server(
                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/test/test_utils.py", line 968, in popen_launch_server
    raise Exception(error_msg + ". Check server logs for errors.")
Exception: Server process exited with code 1. Check server logs for errors.

----------------------------------------------------------------------
Ran 7 tests in 451.099s

FAILED (errors=1, skipped=1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments