Skip to content

GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214

Merged
tianleiwu merged 3 commits intomainfrom
tlwu/gather_block_quant_8bits
Jul 9, 2025
Merged

GatherBlockQuantized supports zero points and 8 bits for uint8 dtype#25214
tianleiwu merged 3 commits intomainfrom
tlwu/gather_block_quant_8bits

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Jun 29, 2025

Add support for unit8 GatherBlockQuantized for the following two areas:

  • Allow zero points.
  • Add bits attribute and support bits=8.

Major change is to update shape inference; and update unit tests to cover these.

Note that only CPU implementation, and CUDA implementation will be added later in another PR.

Motivation and Context

Previously, zero points are not supported when dtype is uint8. Only 4 bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization between GatherBlockQuantized and MatMulNBits.

For example, when K is multiple of block_size, typical input and output shapes are like the following:

  • data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits.
  • scales has shape (N, k_blocks), where k_blocks = (K / block_size).
  • zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) / 2) for 4 bits.
  • output will have shape (..., K), where ... is the shape of indices.

@tianleiwu tianleiwu marked this pull request as draft June 29, 2025 23:50
@tianleiwu tianleiwu changed the title GatherBlockQuantized adds bits attributes and support zero points for uint8 dtype GatherBlockQuantized supports zero points and 8 bits for uint8 dtype Jun 30, 2025
@tianleiwu tianleiwu marked this pull request as ready for review July 2, 2025 05:28
@tianleiwu tianleiwu marked this pull request as draft July 2, 2025 17:34
@tianleiwu tianleiwu force-pushed the tlwu/gather_block_quant_8bits branch from f725743 to 63c8f9a Compare July 7, 2025 19:33
@tianleiwu tianleiwu marked this pull request as ready for review July 7, 2025 19:34
@tianleiwu tianleiwu marked this pull request as draft July 7, 2025 23:14
@tianleiwu tianleiwu marked this pull request as ready for review July 8, 2025 00:47
@tianleiwu tianleiwu force-pushed the tlwu/gather_block_quant_8bits branch from 39b55db to a4784ef Compare July 8, 2025 06:08
@tianleiwu tianleiwu merged commit ce8796d into main Jul 9, 2025
103 of 115 checks passed
@tianleiwu tianleiwu deleted the tlwu/gather_block_quant_8bits branch July 9, 2025 03:58
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request Jul 10, 2025
…icrosoft#25214)

Add support for unit8 GatherBlockQuantized for the following two areas:
* Allow zero points.
* Add bits attribute and support bits=8.

Major change is to update shape inference; and update unit tests to
cover these.

Note that only CPU implementation, and CUDA implementation will be added
later in another PR.

### Motivation and Context

Previously, zero points are not supported when dtype is uint8. Only 4
bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization
between GatherBlockQuantized and MatMulNBits.
 
For example, when K is multiple of `block_size`, typical input and
output shapes are like the following:
 * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits.
 * scales has shape (N, k_blocks), where k_blocks = (K / block_size).
* zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) /
2) for 4 bits.
 * output will have shape (..., K), where ... is the shape of `indices`.
qti-yuduo pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Aug 8, 2025
…icrosoft#25214)

Add support for unit8 GatherBlockQuantized for the following two areas:
* Allow zero points.
* Add bits attribute and support bits=8.

Major change is to update shape inference; and update unit tests to
cover these.

Note that only CPU implementation, and CUDA implementation will be added
later in another PR.

### Motivation and Context

Previously, zero points are not supported when dtype is uint8. Only 4
bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization
between GatherBlockQuantized and MatMulNBits.
 
For example, when K is multiple of `block_size`, typical input and
output shapes are like the following:
 * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits.
 * scales has shape (N, k_blocks), where k_blocks = (K / block_size).
* zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) /
2) for 4 bits.
 * output will have shape (..., K), where ... is the shape of `indices`.
sanketkaleoss pushed a commit to sanketkaleoss/onnxruntime that referenced this pull request Aug 11, 2025
…icrosoft#25214)

Add support for unit8 GatherBlockQuantized for the following two areas:
* Allow zero points.
* Add bits attribute and support bits=8.

Major change is to update shape inference; and update unit tests to
cover these.

Note that only CPU implementation, and CUDA implementation will be added
later in another PR.

### Motivation and Context

Previously, zero points are not supported when dtype is uint8. Only 4
bit quantization without zero points were supported.
This change is to share weights of lm_head with 8 bit quantization
between GatherBlockQuantized and MatMulNBits.
 
For example, when K is multiple of `block_size`, typical input and
output shapes are like the following:
 * data has shape (N, K) for 8 bits, or (N, K / 2) for 4 bits.
 * scales has shape (N, k_blocks), where k_blocks = (K / block_size).
* zero_points has shape (N, k_blocks) for 8 bits, (N, (k_blocks + 1) /
2) for 4 bits.
 * output will have shape (..., K), where ... is the shape of `indices`.
jiafatom added a commit to microsoft/onnxruntime-genai that referenced this pull request Aug 22, 2025
Follow up the idea in
#1461
Now after GatherBlockQuantized implemented in
microsoft/onnxruntime#25214, we can tie
embedding here.

Tested on phi-4-mini-instruct, cpu model size reduces from 5.15 GB to
2.69 GB (47.8% drop)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants