Skip to content

Conversation

@DarkSharpness
Copy link
Collaborator

@DarkSharpness DarkSharpness commented Jan 1, 2026

Motivation

Currently, SGLang uses torch native API to store key/value to cache, which is highly inefficient. Even overlapped in 2 streams, the performance is still poor.

Modifications

This PR is a superset of #9775, the AOT kernel in SGLang. We introduce many aggressive optimizations to minimize the latency, especially for cases where num_kv_head * head_dim is large (e.g. 1024 for Llama 3.1 8B on 1 GPU)

This PR also fixes some minor errors in qknorm, and move norm.cuh to elementwise/qknorm.cuh.

Accuracy Tests

Benchmarking and Profiling

Latency (μs) on B200. PyTorch 2 Stream is current SGLang implementation.
item_size batch_size SGL AOT Kernel SGL JIT Kernel PyTorch Compile PyTorch 2 Stream
64 1 1.475760 1.009964 2.879933 2.137862
64 2 1.479312 1.008584 1.332617 2.165983
64 4 1.495821 1.015186 1.366358 2.651818
64 8 1.526468 1.024717 1.368124 3.605899
64 16 1.546549 1.030255 1.373649 3.568185
64 32 1.551000 1.029608 1.369916 3.574067
64 64 1.550317 1.033704 1.381571 3.578550
64 128 1.549651 1.039844 1.400229 3.606233
64 256 1.556941 1.057436 1.428082 3.641356
64 512 1.574894 1.083511 1.458229 3.716218
64 1024 1.603472 1.166626 1.525649 3.891339
64 2048 1.654045 1.438690 1.632083 3.951207
64 4096 1.911969 2.006328 2.048600 4.124578
64 8192 2.149733 3.094516 3.121697 6.165115
64 16384 3.372198 5.360094 5.390531 10.150578
128 1 1.453023 1.023631 2.690649 2.086427
128 2 1.502191 1.023050 2.881370 2.623525
128 4 1.509589 1.025213 2.970286 3.469111
128 8 1.549430 1.032561 2.958842 3.456083
128 16 1.560619 1.035725 2.966853 3.461197
128 32 1.565613 1.037580 2.979216 3.461016
128 64 1.562321 1.036385 3.018421 3.487620
128 128 1.564134 1.046548 3.093920 3.529567
128 256 1.569323 1.055802 3.158427 3.568132
128 512 1.587742 1.085573 3.297440 3.737223
128 1024 1.623137 1.165268 3.536560 3.838918
128 2048 1.721122 1.441099 4.393200 4.023174
128 4096 1.941110 2.025127 6.513865 5.991636
128 8192 2.717290 3.156197 11.109573 9.710951
128 16384 4.957720 5.841623 20.513093 19.146407
256 1 1.493269 1.031297 2.813659 2.569373
256 2 1.503021 1.033991 2.881263 3.465180
256 4 1.523203 1.029978 2.963893 3.453150
256 8 1.613259 1.036595 2.957579 3.465483
256 16 1.629727 1.044609 2.989868 3.466683
256 32 1.632352 1.048610 3.015526 3.491033
256 64 1.632348 1.043375 3.083132 3.517918
256 128 1.627357 1.049786 3.158427 3.562083
256 256 1.635622 1.070493 3.317974 3.724180
256 512 1.660629 1.096066 3.529316 3.823884
256 1024 1.698602 1.176087 4.392987 4.031836
256 2048 1.908339 1.475349 6.517553 6.001262
256 4096 2.393894 2.073650 11.060907 9.847213
256 8192 4.742189 4.138124 20.497467 19.360554
256 16384 7.464768 6.823438 37.542733 36.374718
512 1 1.640000 1.032392 2.840718 3.352254
512 2 1.646201 1.029739 2.942000 3.450450
512 4 1.673959 1.036462 2.940267 3.399091
512 8 1.812954 1.040000 2.982378 3.458252
512 16 1.812832 1.043403 3.011333 3.476167
512 32 1.820072 1.040862 3.083919 3.511279
512 64 1.821294 1.045258 3.157684 3.558164
512 128 1.828406 1.065219 3.325315 3.711719
512 256 1.843185 1.090376 3.535657 3.822081
512 512 1.863613 1.173345 4.393135 4.019737
512 1024 1.932815 1.481121 6.518250 5.994115
512 2048 2.475768 2.018677 11.018027 10.348198
512 4096 5.049683 4.044462 20.509706 20.092819
512 8192 7.490104 6.604655 37.552732 36.982570
512 16384 12.875270 11.570343 71.475271 70.465395
1024 1 2.098284 1.031026 2.829718 3.322475
1024 2 2.148030 1.037002 2.941216 3.445124
1024 4 2.205063 1.039145 2.984213 3.446770
1024 8 2.326964 1.042905 3.017680 3.482164
1024 16 2.350025 1.038645 3.082880 3.509426
1024 32 2.357849 1.043327 3.185493 3.562909
1024 64 2.384495 1.063795 3.311520 3.709295
1024 128 2.397448 1.089555 3.529307 3.813200
1024 256 2.394682 1.165577 4.395813 4.024131
1024 512 2.443349 1.474329 6.522480 5.993570
1024 1024 3.948729 2.061465 11.022880 10.497917
1024 2048 6.112113 4.064229 20.509120 20.146477
1024 4096 8.128917 6.614896 37.534833 37.019556
1024 8192 13.480798 11.590386 71.482375 70.432544
1024 16384 24.767629 21.567357 139.389059 138.508355

e2e throughput gain for Llama 3.1 8B on B200 in send_one: 248.5 -> 254.5

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@DarkSharpness DarkSharpness requested a review from BBuf as a code owner January 1, 2026 17:53
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @DarkSharpness, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to dramatically enhance the performance of KV cache operations in SGLang by integrating a new JIT-compiled CUDA kernel. The current PyTorch-based approach is identified as a bottleneck, and this change introduces highly optimized memory copy operations for KV cache storage. The improvements are particularly beneficial for large language models, and the PR provides extensive benchmarks validating the significant latency reductions achieved. Additionally, it includes minor corrections and structural improvements to the QK-Norm kernel.

Highlights

  • JIT KV Cache Support: Introduces a Just-In-Time (JIT) compiled kernel for storing key-value (KV) cache, significantly improving efficiency over the existing PyTorch native API.
  • Performance Optimizations: Implements aggressive optimizations tailored for scenarios with large 'num_kv_head * head_dim' (e.g., Llama 3.1 8B), leading to substantial latency reductions as demonstrated by benchmarks.
  • QK-Norm Refactoring & Fixes: Addresses minor errors in the 'qknorm' functionality and reorganizes its underlying CUDA kernel file by moving 'norm.cuh' to 'elementwise/qknorm.cuh'.
  • Comprehensive Benchmarking: Includes detailed performance benchmarks comparing the new SGL JIT Kernel against SGL AOT Kernel, PyTorch Compile, and PyTorch 2 Stream, showcasing the JIT kernel's superior performance across various batch and item sizes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a JIT-compiled kernel for setting the KV cache, aiming to improve performance over the existing PyTorch-based implementation. The changes include the new CUDA kernel, its Python interface, corresponding benchmarks, and unit tests. Additionally, it refactors the qknorm kernel by renaming and moving files.

My review focuses on the new JIT kernel implementation and its usage. I've identified a potential performance improvement in the Python wrapper for the new kernel and a minor style issue in one of the benchmark files. Overall, the changes look good and the performance gains shown in the benchmarks are impressive.

@DarkSharpness
Copy link
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant