GatherBlockQuantized supports zero points and 8 bits for uint8 dtype by tianleiwu · Pull Request #25214 · microsoft/onnxruntime

tianleiwu · 2025-06-29T23:50:46Z

Add support for unit8 GatherBlockQuantized for the following two areas:

Allow zero points.
Add bits attribute and support bits=8.

Major change is to update shape inference; and update unit tests to cover these.

Note that only CPU implementation, and CUDA implementation will be added later in another PR.

Motivation and Context

Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits.

For example, when K is multiple of block_size, typical input and output shapes are like the following:

data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits.
scales has shape (N, k_blocks), where k_blocks = (K / block_size).
zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits.
output will have shape (..., K), where ... is the shape of indices.

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc

…icrosoft#25214) Add support for unit8 GatherBlockQuantized for the following two areas: * Allow zero points. * Add bits attribute and support bits=8. Major change is to update shape inference; and update unit tests to cover these. Note that only CPU implementation, and CUDA implementation will be added later in another PR. ### Motivation and Context Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported. This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits. For example, when K is multiple of `block_size`, typical input and output shapes are like the following: * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits. * scales has shape (N, k_blocks), where k_blocks = (K / block_size). * zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits. * output will have shape (..., K), where ... is the shape of `indices`.

Follow up the idea in #1461 Now after GatherBlockQuantized implemented in microsoft/onnxruntime#25214, we can tie embedding here. Tested on phi-4-mini-instruct, cpu model size reduces from 5.15 GB to 2.69 GB (47.8% drop)

tianleiwu marked this pull request as draft June 29, 2025 23:50

tianleiwu changed the title ~~GatherBlockQuantized adds bits attributes and support zero points for uint8 dtype~~ GatherBlockQuantized supports zero points and 8 bits for uint8 dtype Jun 30, 2025

tianleiwu marked this pull request as ready for review July 2, 2025 05:28

tianleiwu requested review from jiafatom, kunal-vaishnavi and sushraja-msft July 2, 2025 17:06

tianleiwu marked this pull request as draft July 2, 2025 17:34

tianleiwu force-pushed the tlwu/gather_block_quant_8bits branch from f725743 to 63c8f9a Compare July 7, 2025 19:33

tianleiwu marked this pull request as ready for review July 7, 2025 19:34

tianleiwu marked this pull request as draft July 7, 2025 23:14

tianleiwu marked this pull request as ready for review July 8, 2025 00:47

Support bits=4 and zero point for uint8 input

a4784ef

tianleiwu force-pushed the tlwu/gather_block_quant_8bits branch from 39b55db to a4784ef Compare July 8, 2025 06:08

tianleiwu added 2 commits July 8, 2025 10:11

refine

ff9ca7e

fix test

8f1d98d

jiafatom reviewed Jul 9, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc Show resolved Hide resolved

onnxruntime/contrib_ops/cpu/quantization/gather_block_quantized.cc Show resolved Hide resolved

jiafatom approved these changes Jul 9, 2025

View reviewed changes

kunal-vaishnavi approved these changes Jul 9, 2025

View reviewed changes

tianleiwu merged commit ce8796d into main Jul 9, 2025
103 of 115 checks passed

tianleiwu deleted the tlwu/gather_block_quant_8bits branch July 9, 2025 03:58

jiafatom mentioned this pull request Aug 18, 2025

Tie embedding weight sharing microsoft/onnxruntime-genai#1690

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214

GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214
tianleiwu merged 3 commits intomainfrom
tlwu/gather_block_quant_8bits

tianleiwu commented Jun 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jun 29, 2025 •

edited

Loading