GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214
Merged
GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214
Conversation
f725743 to
63c8f9a
Compare
39b55db to
a4784ef
Compare
jiafatom
reviewed
Jul 9, 2025
jiafatom
approved these changes
Jul 9, 2025
kunal-vaishnavi
approved these changes
Jul 9, 2025
ankitm3k
pushed a commit
to intel/onnxruntime
that referenced
this pull request
Jul 10, 2025
…icrosoft#25214) Add support for unit8 GatherBlockQuantized for the following two areas: * Allow zero points. * Add bits attribute and support bits=8. Major change is to update shape inference; and update unit tests to cover these. Note that only CPU implementation, and CUDA implementation will be added later in another PR. ### Motivation and Context Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported. This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits. For example, when K is multiple of `block_size`, typical input and output shapes are like the following: * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits. * scales has shape (N, k_blocks), where k_blocks = (K / block_size). * zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits. * output will have shape (..., K), where ... is the shape of `indices`.
qti-yuduo
pushed a commit
to CodeLinaro/onnxruntime
that referenced
this pull request
Aug 8, 2025
…icrosoft#25214) Add support for unit8 GatherBlockQuantized for the following two areas: * Allow zero points. * Add bits attribute and support bits=8. Major change is to update shape inference; and update unit tests to cover these. Note that only CPU implementation, and CUDA implementation will be added later in another PR. ### Motivation and Context Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported. This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits. For example, when K is multiple of `block_size`, typical input and output shapes are like the following: * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits. * scales has shape (N, k_blocks), where k_blocks = (K / block_size). * zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits. * output will have shape (..., K), where ... is the shape of `indices`.
sanketkaleoss
pushed a commit
to sanketkaleoss/onnxruntime
that referenced
this pull request
Aug 11, 2025
…icrosoft#25214) Add support for unit8 GatherBlockQuantized for the following two areas: * Allow zero points. * Add bits attribute and support bits=8. Major change is to update shape inference; and update unit tests to cover these. Note that only CPU implementation, and CUDA implementation will be added later in another PR. ### Motivation and Context Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported. This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits. For example, when K is multiple of `block_size`, typical input and output shapes are like the following: * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits. * scales has shape (N, k_blocks), where k_blocks = (K / block_size). * zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits. * output will have shape (..., K), where ... is the shape of `indices`.
jiafatom
added a commit
to microsoft/onnxruntime-genai
that referenced
this pull request
Aug 22, 2025
Follow up the idea in #1461 Now after GatherBlockQuantized implemented in microsoft/onnxruntime#25214, we can tie embedding here. Tested on phi-4-mini-instruct, cpu model size reduces from 5.15 GB to 2.69 GB (47.8% drop)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for unit8 GatherBlockQuantized for the following two areas:
Major change is to update shape inference; and update unit tests to cover these.
Note that only CPU implementation, and CUDA implementation will be added later in another PR.
Motivation and Context
Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits.
For example, when K is multiple of
block_size, typical input and output shapes are like the following:indices.