Skip to content

[gluon][examples] MoE bmm1 in Gluon#10047

Open
Mogball wants to merge 8 commits intomainfrom
jeffniu/bmm1
Open

[gluon][examples] MoE bmm1 in Gluon#10047
Mogball wants to merge 8 commits intomainfrom
jeffniu/bmm1

Conversation

@Mogball
Copy link
Copy Markdown
Collaborator

@Mogball Mogball commented Apr 15, 2026

The main beneficial optimizations are:

  1. Separate loader for weight and weight scales, allowing asymmetric pipelining. Potentially we can have a separate partition for scale factor load as well, but I didn't experiment with it.
  2. Epilogue optimizations to decrease the number of instructions. Epilogue instruction issue (especially with 8 warps) staves the other warps of instruction issuing. Optimizing it improves overall performance. I repurposed the 2 idle leftover warps as a store partition to decrease the critical path in the epilogue.

gpt-oss-120b shapes, performance vs. triton_kernels.matmul on synthetic "realistic" logits.

GPT-OSS-120B MoE MM1 E=128 EP=4 ES=8 B=2880x5760
Peak: 5 PFLOPS, 8 TBPS
batch_size                                            example                                          reference
----------------------------------------------------------------------------------------------------------------
       128        29.12 TFLOPS (  0.6%)    5.27 TBPS ( 65.9%)        22.74 TFLOPS (  0.5%)    4.11 TBPS ( 51.4%)
       160        36.67 TFLOPS (  0.7%)    5.96 TBPS ( 74.5%)        27.66 TFLOPS (  0.6%)    4.49 TBPS ( 56.2%)
       192        39.57 TFLOPS (  0.8%)    6.00 TBPS ( 75.0%)        28.94 TFLOPS (  0.6%)    4.39 TBPS ( 54.9%)
       224        52.64 TFLOPS (  1.1%)    5.99 TBPS ( 74.9%)        39.21 TFLOPS (  0.8%)    4.46 TBPS ( 55.8%)
       256        55.48 TFLOPS (  1.1%)    6.46 TBPS ( 80.7%)        40.17 TFLOPS (  0.8%)    4.67 TBPS ( 58.4%)
       320        67.89 TFLOPS (  1.4%)    6.64 TBPS ( 83.0%)        41.19 TFLOPS (  0.8%)    4.03 TBPS ( 50.4%)
       384        73.96 TFLOPS (  1.5%)    6.43 TBPS ( 80.4%)        45.74 TFLOPS (  0.9%)    3.98 TBPS ( 49.7%)
       448        91.18 TFLOPS (  1.8%)    6.64 TBPS ( 83.0%)        53.82 TFLOPS (  1.1%)    3.92 TBPS ( 49.0%)
       512        78.16 TFLOPS (  1.6%)    5.36 TBPS ( 67.0%)        60.78 TFLOPS (  1.2%)    4.17 TBPS ( 52.1%)
       640        90.66 TFLOPS (  1.8%)    5.17 TBPS ( 64.6%)        58.41 TFLOPS (  1.2%)    3.33 TBPS ( 41.6%)
       768        93.65 TFLOPS (  1.9%)    4.98 TBPS ( 62.3%)        61.17 TFLOPS (  1.2%)    3.25 TBPS ( 40.7%)
       896       103.66 TFLOPS (  2.1%)    5.04 TBPS ( 63.1%)        66.78 TFLOPS (  1.3%)    3.25 TBPS ( 40.6%)
      1024       135.60 TFLOPS (  2.7%)    5.16 TBPS ( 64.5%)        86.31 TFLOPS (  1.7%)    3.28 TBPS ( 41.0%)
      1280       131.21 TFLOPS (  2.6%)    3.80 TBPS ( 47.5%)       119.41 TFLOPS (  2.4%)    3.46 TBPS ( 43.3%)
      1536       175.65 TFLOPS (  3.5%)    4.26 TBPS ( 53.3%)       145.39 TFLOPS (  2.9%)    3.53 TBPS ( 44.1%)
      1792       181.42 TFLOPS (  3.6%)    3.85 TBPS ( 48.1%)       148.73 TFLOPS (  3.0%)    3.15 TBPS ( 39.4%)
      2048       188.32 TFLOPS (  3.8%)    3.96 TBPS ( 49.5%)       165.47 TFLOPS (  3.3%)    3.48 TBPS ( 43.5%)
      2560       205.57 TFLOPS (  4.1%)    3.45 TBPS ( 43.2%)       177.60 TFLOPS (  3.6%)    2.98 TBPS ( 37.3%)
      3072       263.31 TFLOPS (  5.3%)    3.62 TBPS ( 45.2%)       253.05 TFLOPS (  5.1%)    3.48 TBPS ( 43.5%)
      3584       283.21 TFLOPS (  5.7%)    3.38 TBPS ( 42.2%)       290.26 TFLOPS (  5.8%)    3.46 TBPS ( 43.2%)
      4096       348.60 TFLOPS (  7.0%)    3.61 TBPS ( 45.2%)       329.95 TFLOPS (  6.6%)    3.42 TBPS ( 42.8%)
      5120       421.37 TFLOPS (  8.4%)    3.57 TBPS ( 44.6%)       385.63 TFLOPS (  7.7%)    3.27 TBPS ( 40.8%)
      6144       524.65 TFLOPS ( 10.5%)    3.69 TBPS ( 46.1%)       473.72 TFLOPS (  9.5%)    3.33 TBPS ( 41.6%)
      7168       626.20 TFLOPS ( 12.5%)    3.79 TBPS ( 47.4%)       582.67 TFLOPS ( 11.7%)    3.53 TBPS ( 44.1%)
      8192       675.19 TFLOPS ( 13.5%)    3.64 TBPS ( 45.6%)       629.79 TFLOPS ( 12.6%)    3.40 TBPS ( 42.5%)
      9216       729.35 TFLOPS ( 14.6%)    3.62 TBPS ( 45.2%)       674.98 TFLOPS ( 13.5%)    3.35 TBPS ( 41.9%)
     10240       749.61 TFLOPS ( 15.0%)    3.40 TBPS ( 42.4%)       685.04 TFLOPS ( 13.7%)    3.10 TBPS ( 38.8%)
     11264       806.46 TFLOPS ( 16.1%)    3.34 TBPS ( 41.8%)       721.29 TFLOPS ( 14.4%)    2.99 TBPS ( 37.3%)
     12288       895.78 TFLOPS ( 17.9%)    3.38 TBPS ( 42.3%)       805.38 TFLOPS ( 16.1%)    3.04 TBPS ( 38.0%)
     13312      1008.57 TFLOPS ( 20.2%)    3.53 TBPS ( 44.1%)       921.30 TFLOPS ( 18.4%)    3.22 TBPS ( 40.3%)
     14336      1001.71 TFLOPS ( 20.0%)    3.24 TBPS ( 40.5%)       915.14 TFLOPS ( 18.3%)    2.96 TBPS ( 37.0%)
     15360      1108.54 TFLOPS ( 22.2%)    3.35 TBPS ( 41.8%)      1020.08 TFLOPS ( 20.4%)    3.08 TBPS ( 38.5%)
     16384      1182.28 TFLOPS ( 23.6%)    3.41 TBPS ( 42.6%)      1083.60 TFLOPS ( 21.7%)    3.12 TBPS ( 39.0%)
     17408      1255.35 TFLOPS ( 25.1%)    3.44 TBPS ( 43.0%)      1106.65 TFLOPS ( 22.1%)    3.03 TBPS ( 37.9%)
     18432      1339.98 TFLOPS ( 26.8%)    3.50 TBPS ( 43.7%)      1188.63 TFLOPS ( 23.8%)    3.10 TBPS ( 38.8%)
     19456      1411.14 TFLOPS ( 28.2%)    3.49 TBPS ( 43.6%)      1239.33 TFLOPS ( 24.8%)    3.06 TBPS ( 38.3%)
     20480      1465.49 TFLOPS ( 29.3%)    3.46 TBPS ( 43.3%)      1329.22 TFLOPS ( 26.6%)    3.14 TBPS ( 39.2%)
     21504      1429.77 TFLOPS ( 28.6%)    3.23 TBPS ( 40.3%)      1228.47 TFLOPS ( 24.6%)    2.77 TBPS ( 34.7%)
     22528      1432.90 TFLOPS ( 28.7%)    3.08 TBPS ( 38.5%)      1283.04 TFLOPS ( 25.7%)    2.76 TBPS ( 34.5%)
     23552      1530.40 TFLOPS ( 30.6%)    3.17 TBPS ( 39.6%)      1304.38 TFLOPS ( 26.1%)    2.70 TBPS ( 33.8%)
     24576      1531.43 TFLOPS ( 30.6%)    3.05 TBPS ( 38.1%)      1378.61 TFLOPS ( 27.6%)    2.74 TBPS ( 34.3%)
     25600      1609.22 TFLOPS ( 32.2%)    3.09 TBPS ( 38.7%)      1403.54 TFLOPS ( 28.1%)    2.70 TBPS ( 33.7%)
     26624      1591.86 TFLOPS ( 31.8%)    2.96 TBPS ( 37.0%)      1461.78 TFLOPS ( 29.2%)    2.72 TBPS ( 34.0%)
     27648      1708.82 TFLOPS ( 34.2%)    3.08 TBPS ( 38.5%)      1554.22 TFLOPS ( 31.1%)    2.80 TBPS ( 35.0%)
     28672      1715.65 TFLOPS ( 34.3%)    3.00 TBPS ( 37.5%)      1582.07 TFLOPS ( 31.6%)    2.77 TBPS ( 34.6%)
     29696      1756.95 TFLOPS ( 35.1%)    2.98 TBPS ( 37.3%)      1562.01 TFLOPS ( 31.2%)    2.65 TBPS ( 33.1%)
     30720      1801.20 TFLOPS ( 36.0%)    2.97 TBPS ( 37.2%)      1681.00 TFLOPS ( 33.6%)    2.77 TBPS ( 34.7%)
     31744      1979.10 TFLOPS ( 39.6%)    3.17 TBPS ( 39.7%)      1754.99 TFLOPS ( 35.1%)    2.82 TBPS ( 35.2%)

And on uniform logits

batch_size                                            example                                          reference
----------------------------------------------------------------------------------------------------------------
       128        73.67 TFLOPS (  1.5%)    5.41 TBPS ( 67.7%)        60.70 TFLOPS (  1.2%)    4.46 TBPS ( 55.7%)
       160        96.46 TFLOPS (  1.9%)    5.50 TBPS ( 68.7%)        78.10 TFLOPS (  1.6%)    4.45 TBPS ( 55.6%)
       192       115.43 TFLOPS (  2.3%)    5.43 TBPS ( 67.9%)        94.71 TFLOPS (  1.9%)    4.45 TBPS ( 55.7%)
       224       141.31 TFLOPS (  2.8%)    5.47 TBPS ( 68.3%)       115.26 TFLOPS (  2.3%)    4.46 TBPS ( 55.7%)
       256       155.77 TFLOPS (  3.1%)    5.47 TBPS ( 68.4%)       126.31 TFLOPS (  2.5%)    4.44 TBPS ( 55.5%)
       320       205.17 TFLOPS (  4.1%)    5.46 TBPS ( 68.2%)       161.57 TFLOPS (  3.2%)    4.30 TBPS ( 53.7%)
       384       251.10 TFLOPS (  5.0%)    5.50 TBPS ( 68.8%)       195.10 TFLOPS (  3.9%)    4.28 TBPS ( 53.5%)
       448       295.50 TFLOPS (  5.9%)    5.53 TBPS ( 69.2%)       228.63 TFLOPS (  4.6%)    4.28 TBPS ( 53.5%)
       512       316.07 TFLOPS (  6.3%)    5.34 TBPS ( 66.7%)       254.07 TFLOPS (  5.1%)    4.29 TBPS ( 53.6%)
       640       414.90 TFLOPS (  8.3%)    5.43 TBPS ( 67.9%)       293.48 TFLOPS (  5.9%)    3.84 TBPS ( 48.0%)
       768       489.21 TFLOPS (  9.8%)    5.43 TBPS ( 67.9%)       350.19 TFLOPS (  7.0%)    3.89 TBPS ( 48.6%)
       896       553.62 TFLOPS ( 11.1%)    5.39 TBPS ( 67.4%)       405.50 TFLOPS (  8.1%)    3.95 TBPS ( 49.4%)
      1024       576.71 TFLOPS ( 11.5%)    4.92 TBPS ( 61.5%)       463.70 TFLOPS (  9.3%)    3.95 TBPS ( 49.4%)
      1280       682.95 TFLOPS ( 13.7%)    4.76 TBPS ( 59.5%)       571.87 TFLOPS ( 11.4%)    3.98 TBPS ( 49.8%)
      1536       837.25 TFLOPS ( 16.7%)    4.99 TBPS ( 62.4%)       679.20 TFLOPS ( 13.6%)    4.05 TBPS ( 50.6%)
      1792       934.09 TFLOPS ( 18.7%)    4.84 TBPS ( 60.5%)       686.46 TFLOPS ( 13.7%)    3.56 TBPS ( 44.5%)
      2048       937.04 TFLOPS ( 18.7%)    4.25 TBPS ( 53.2%)       714.85 TFLOPS ( 14.3%)    3.24 TBPS ( 40.6%)
      2560      1081.82 TFLOPS ( 21.6%)    4.07 TBPS ( 50.9%)      1020.85 TFLOPS ( 20.4%)    3.84 TBPS ( 48.0%)
      3072      1313.97 TFLOPS ( 26.3%)    4.21 TBPS ( 52.7%)      1209.26 TFLOPS ( 24.2%)    3.88 TBPS ( 48.5%)
      3584      1533.25 TFLOPS ( 30.7%)    4.27 TBPS ( 53.3%)      1410.67 TFLOPS ( 28.2%)    3.93 TBPS ( 49.1%)
      4096      1399.52 TFLOPS ( 28.0%)    3.42 TBPS ( 42.7%)      1245.96 TFLOPS ( 24.9%)    3.04 TBPS ( 38.1%)
      5120      1489.01 TFLOPS ( 29.8%)    2.97 TBPS ( 37.1%)      1320.69 TFLOPS ( 26.4%)    2.63 TBPS ( 32.9%)
      6144      1901.82 TFLOPS ( 38.0%)    3.21 TBPS ( 40.2%)      1638.02 TFLOPS ( 32.8%)    2.77 TBPS ( 34.6%)
      7168      2234.87 TFLOPS ( 44.7%)    3.28 TBPS ( 41.0%)      1987.25 TFLOPS ( 39.7%)    2.92 TBPS ( 36.5%)
      8192      1999.71 TFLOPS ( 40.0%)    2.59 TBPS ( 32.4%)      1747.04 TFLOPS ( 34.9%)    2.26 TBPS ( 28.3%)
      9216      2024.82 TFLOPS ( 40.5%)    2.38 TBPS ( 29.7%)      1802.45 TFLOPS ( 36.0%)    2.12 TBPS ( 26.5%)
     10240      2233.54 TFLOPS ( 44.7%)    2.38 TBPS ( 29.8%)      1987.52 TFLOPS ( 39.8%)    2.12 TBPS ( 26.5%)
     11264      2423.11 TFLOPS ( 48.5%)    2.40 TBPS ( 30.0%)      2163.00 TFLOPS ( 43.3%)    2.14 TBPS ( 26.8%)
     12288      2230.25 TFLOPS ( 44.6%)    2.05 TBPS ( 25.7%)      1971.60 TFLOPS ( 39.4%)    1.82 TBPS ( 22.7%)
     13312      2386.76 TFLOPS ( 47.7%)    2.07 TBPS ( 25.9%)      2089.59 TFLOPS ( 41.8%)    1.81 TBPS ( 22.7%)
     14336      2535.85 TFLOPS ( 50.7%)    2.08 TBPS ( 25.9%)      2214.30 TFLOPS ( 44.3%)    1.81 TBPS ( 22.7%)
     15360      2708.63 TFLOPS ( 54.2%)    2.10 TBPS ( 26.3%)      2371.09 TFLOPS ( 47.4%)    1.84 TBPS ( 23.0%)
     16384      2495.78 TFLOPS ( 49.9%)    1.85 TBPS ( 23.1%)      2256.94 TFLOPS ( 45.1%)    1.67 TBPS ( 20.9%)
     17408      2613.94 TFLOPS ( 52.3%)    1.85 TBPS ( 23.1%)      2350.97 TFLOPS ( 47.0%)    1.66 TBPS ( 20.8%)
     18432      2600.95 TFLOPS ( 52.0%)    1.76 TBPS ( 22.0%)      2376.24 TFLOPS ( 47.5%)    1.61 TBPS ( 20.1%)
     19456      2720.43 TFLOPS ( 54.4%)    1.77 TBPS ( 22.1%)      2445.94 TFLOPS ( 48.9%)    1.59 TBPS ( 19.9%)
     20480      2694.19 TFLOPS ( 53.9%)    1.68 TBPS ( 21.0%)      2418.54 TFLOPS ( 48.4%)    1.51 TBPS ( 18.9%)
     21504      2667.33 TFLOPS ( 53.3%)    1.61 TBPS ( 20.1%)      2432.76 TFLOPS ( 48.7%)    1.47 TBPS ( 18.4%)
     22528      2773.18 TFLOPS ( 55.5%)    1.62 TBPS ( 20.3%)      2531.76 TFLOPS ( 50.6%)    1.48 TBPS ( 18.5%)
     23552      2867.55 TFLOPS ( 57.4%)    1.62 TBPS ( 20.3%)      2604.20 TFLOPS ( 52.1%)    1.48 TBPS ( 18.4%)
     24576      2736.19 TFLOPS ( 54.7%)    1.50 TBPS ( 18.8%)      2499.52 TFLOPS ( 50.0%)    1.37 TBPS ( 17.2%)
     25600      2789.45 TFLOPS ( 55.8%)    1.49 TBPS ( 18.6%)      2554.81 TFLOPS ( 51.1%)    1.37 TBPS ( 17.1%)
     26624      2882.35 TFLOPS ( 57.6%)    1.50 TBPS ( 18.7%)      2669.93 TFLOPS ( 53.4%)    1.39 TBPS ( 17.4%)
     27648      2856.47 TFLOPS ( 57.1%)    1.45 TBPS ( 18.1%)      2624.29 TFLOPS ( 52.5%)    1.33 TBPS ( 16.7%)
     28672      2851.52 TFLOPS ( 57.0%)    1.41 TBPS ( 17.7%)      2621.20 TFLOPS ( 52.4%)    1.30 TBPS ( 16.2%)
     29696      2826.90 TFLOPS ( 56.5%)    1.37 TBPS ( 17.1%)      2483.67 TFLOPS ( 49.7%)    1.20 TBPS ( 15.1%)
     30720      2894.51 TFLOPS ( 57.9%)    1.37 TBPS ( 17.2%)      2527.41 TFLOPS ( 50.5%)    1.20 TBPS ( 15.0%)
     31744      2875.63 TFLOPS ( 57.5%)    1.34 TBPS ( 16.7%)      2505.68 TFLOPS ( 50.1%)    1.16 TBPS ( 14.5%)

Comment thread python/examples/gluon/05-moe-bmm1-fused-gather.py Outdated
@Mogball Mogball marked this pull request as ready for review April 16, 2026 21:50
@Mogball Mogball requested a review from ptillet as a code owner April 16, 2026 21:50
@Mogball Mogball changed the title [WIP][Gluon] MoE bmm1 in Gluon [gluon][examples] MoE bmm1 in Gluon Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants