refactor apply_w8a8_block_fp8_linear in fp #6545

ChangyiYang · 2025-05-23T05:46:07Z

Motivation

refactor apply_w8a8_block_fp8_linear to make the logic more clear and more meaningful

Modifications

only refactor apply_w8a8_block_fp8_linear in fb8_utils

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/sglang/srt/layers/quantization/fp8.py

python/sglang/srt/layers/quantization/fp8_utils.py

ChangyiYang · 2025-05-25T23:02:23Z

Hi! I have create a new commit. Here are some bullet points

Refactored dispatch logic to return only the selected matmul implementation function at initialization time to avoid filtering overhead.
For implementations requiring runtime validation (e.g., shape or dtype constraints), added fallback to Triton when conditions are not met.
Moved runtime condition checks ahead of computation to avoid redundant operations and unnecessary overhead.
Extracted w8a8_block_fp8_matmul_deepgemm from w8a8_block_fp8_matmul to enable direct usage of DeepGEMM kernels when applicable.
Retained the function w8a8_block_fp8_matmul as the unified entry point for matrix multiplication, to avoid requiring users to manually check whether DeepGEMM is available each time they use it.

Some choices I make, which may potentially have some improve space

In w8a8_block_fp8_matmul_deepgemm, use the same assertion check as w8a8_block_fp8_matmul, which may have some unnecessary check

Please check if any futhur modification is needed!

python/sglang/srt/layers/quantization/fp8_kernel.py

ChangyiYang · 2025-05-26T18:48:53Z

Hi ! Here are some modification for this commit

split out w8a8_block_fp8_matmul_triton
retain w8a8_block_fp8_matmul only for testing purpose
for benchmarks that use deepgemm, directly call w8a8_block_fp8_matmul_deepgemm

Feel free to tell me if there anything needs further modification!

sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py

ChangyiYang · 2025-05-26T19:33:22Z

Hi! Typo fixed. Feel free to tell me if there is any more issue!

zhyncs · 2025-05-27T03:42:24Z

cc @HaiShaw

ChangyiYang · 2025-05-27T03:53:13Z

@Alcanderian hi, I click update branch to merge from main and seems needs to be approved again. Can you kindly approve again?

sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py

Co-authored-by: Xiaoyu Zhang <[email protected]>

… universal entry for testing purpose

Co-authored-by: JieXin Liang <[email protected]>

Alcanderian · 2025-05-27T10:39:12Z

@ChangyiYang We have to fix the CI
https://github.com/sgl-project/sglang/actions/runs/15267675566/job/42936151803?pr=6545#step:4:325
https://github.com/sgl-project/sglang/actions/runs/15267675566/job/42936151795?pr=6545#step:4:339
https://github.com/sgl-project/sglang/actions/runs/15267675566/job/42936151794?pr=6545#step:4:3685
https://github.com/sgl-project/sglang/actions/runs/15267675566/job/42936151793?pr=6545#step:4:1352

ChangyiYang · 2025-05-28T02:03:56Z

@Alcanderian thanks for pointing that out! fix that now. Can you run CI again?

ChangyiYang · 2025-05-28T05:04:27Z

Hi! I think I have fixed 2 minor bugs and the CI seems be correct ( the failing one is not actually failing and 9 pipelines are blocking). Feel free to tell me if any more adjustment needed before merge! Thanks you all guys for the help :)

ChangyiYang requested review from merrymercy, Ying1123, zhyncs, ispobock, HaiShaw, ch-wan and BBuf as code owners May 23, 2025 05:46

ChangyiYang mentioned this pull request May 23, 2025

[Feature] add abstraction for different platform #4353

Open

2 tasks

Fridge003 reviewed May 23, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8.py Show resolved Hide resolved

Alcanderian reviewed May 23, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_utils.py Outdated Show resolved Hide resolved

Fridge003 reviewed May 26, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_kernel.py Outdated Show resolved Hide resolved

ChangyiYang requested a review from Alcanderian May 26, 2025 04:38

zhyncs assigned Alcanderian, BBuf and Fridge003 May 26, 2025

BBuf reviewed May 26, 2025

View reviewed changes

python/sglang/srt/layers/quantization/fp8_kernel.py Outdated Show resolved Hide resolved

ChangyiYang requested review from HandH1998, yizhang2077 and yinfan98 as code owners May 26, 2025 18:46

Alcanderian reviewed May 26, 2025

View reviewed changes

sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py Outdated Show resolved Hide resolved

ChangyiYang requested a review from BBuf May 27, 2025 03:09

Alcanderian approved these changes May 27, 2025

View reviewed changes

zhyncs assigned HaiShaw May 27, 2025

Alcanderian reviewed May 27, 2025

View reviewed changes

sgl-kernel/benchmark/bench_fp8_blockwise_gemm.py Outdated Show resolved Hide resolved

refactor apply_w8a8_block_fp8_linear

47a5c62

ChangyiYang and others added 5 commits May 26, 2025 21:08

refactoring the dispatching logic to avoid filtering overhead

87a5629

Modify comments in fp8_kernel.py

cb4909c

Co-authored-by: Xiaoyu Zhang <[email protected]>

create w8a8_block_fp8_matmul_triton, leave w8a8_block_fp8_matmul as a…

fecf409

… universal entry for testing purpose

fix typo

cf851a1

Update kernel function ref in bench_fp8_blockwise_gemm.py

48e3781

Co-authored-by: JieXin Liang <[email protected]>

ChangyiYang force-pushed the refactor_apply_w8a8_block_fp8_linear branch from d9da17a to 48e3781 Compare May 27, 2025 04:08

Merge branch 'main' into refactor_apply_w8a8_block_fp8_linear

e8db7ce

zhyncs added the ready-to-merge The PR is ready to merge after the CI is green. label May 27, 2025

fix referenced before assignment error

bea308a

ChangyiYang added 2 commits May 27, 2025 19:52

fix bug that output_dtype is not passed correctly

2c78443

fix bug that output_dtype is not passed correctly

807fc4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor apply_w8a8_block_fp8_linear in fp #6545

refactor apply_w8a8_block_fp8_linear in fp #6545

ChangyiYang commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

ChangyiYang commented May 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ChangyiYang commented May 26, 2025

Uh oh!

Uh oh!

ChangyiYang commented May 26, 2025

Uh oh!

zhyncs commented May 27, 2025

Uh oh!

ChangyiYang commented May 27, 2025

Uh oh!

Uh oh!

Alcanderian commented May 27, 2025

Uh oh!

ChangyiYang commented May 28, 2025

Uh oh!

ChangyiYang commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

refactor apply_w8a8_block_fp8_linear in fp #6545

Are you sure you want to change the base?

refactor apply_w8a8_block_fp8_linear in fp #6545

Conversation

ChangyiYang commented May 23, 2025

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

ChangyiYang commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChangyiYang commented May 26, 2025

Uh oh!

Uh oh!

ChangyiYang commented May 26, 2025

Uh oh!

zhyncs commented May 27, 2025

Uh oh!

ChangyiYang commented May 27, 2025

Uh oh!

Uh oh!

Alcanderian commented May 27, 2025

Uh oh!

ChangyiYang commented May 28, 2025

Uh oh!

ChangyiYang commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ChangyiYang commented May 25, 2025 •

edited

Loading

ChangyiYang commented May 28, 2025 •

edited

Loading