Olive feedback: excellent quantization experience with Gemma4 models

> This is a feedback generated by copilot

## Positive Feedback

We used Olive to export and quantize all 8 Gemma4 models (E2B, E4B, 26B, 31B — base and instruct variants) across multiple formats (f16, bf16, Q4_K_M, NF4). The experience was excellent — 64 total exports, all verified. Here is our feedback.

### MobiusModelBuilder pass (PR #2406)

Seamless integration with [mobius](https://github.com/onnxruntime/mobius) for ONNX model building. The 2-pass pipeline (build → quantize) in the Olive recipe format is clean and intuitive. The Gemma4 recipes (PR microsoft/olive-recipes#381) worked out of the box.

### Built-in cupy GPU acceleration for kquant

The auto-GPU detection in `OnnxKQuantQuantization` is excellent — we measured **28x speedup** over CPU numpy when cupy + CUDA are available (tested on 4M element weights). The fallback to CPU numpy is seamless.

### OnnxBnb4Quantization (NF4)

Blazing fast — 42ms for 67M parameters. Delegates correctly to ORT's `MatMulBnb4Quantizer`.

---

## Suggestions for Improvement

### 1. Document the cupy GPU acceleration

The auto-GPU feature in kquant is great but easy to miss. Users may not know they need `pip install cupy-cuda12x` to get 28x faster quantization. Consider:
- Adding a note in the pass docstring or README
- Printing an info message when cupy is detected and GPU is used (vs falling back to CPU)

### 2. GPU acceleration for RTN quantization

`OnnxBlockWiseRtnQuantization` currently uses CPU-only numpy for `_quantize_ndarray` (used by Gather/embedding quantization). For very large models (31B+), this becomes a bottleneck. A PyTorch or cupy GPU path (similar to kquant) would be straightforward to add — the math is simple per-group min/max/scale/round.

### 3. Simpler path to quantize existing ONNX models

Currently, re-quantizing a pre-built ONNX model requires going through the full MobiusModelBuilder pipeline again. It would be nice to have a lighter path that takes an existing ONNX model directory as input and only runs the quantization pass(es).

---

## Our Workflow

```
HuggingFace model → mobius (build ONNX) → Olive (quantize) → upload to HF
```

- 8 Gemma4 models × 8 variants = 64 total exports
- Variants: f16, bf16, Q4_K_M (kquant), NF4 (bnb4), plus CPU/CUDA targets
- All tested and verified end-to-end

Thank you for the great tooling!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Olive feedback: excellent quantization experience with Gemma4 models #2440

Positive Feedback

MobiusModelBuilder pass (PR #2406)

Built-in cupy GPU acceleration for kquant

OnnxBnb4Quantization (NF4)

Suggestions for Improvement

1. Document the cupy GPU acceleration

2. GPU acceleration for RTN quantization

3. Simpler path to quantize existing ONNX models

Our Workflow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Olive feedback: excellent quantization experience with Gemma4 models #2440

Description

Positive Feedback

MobiusModelBuilder pass (PR #2406)

Built-in cupy GPU acceleration for kquant

OnnxBnb4Quantization (NF4)

Suggestions for Improvement

1. Document the cupy GPU acceleration

2. GPU acceleration for RTN quantization

3. Simpler path to quantize existing ONNX models

Our Workflow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions