This is a feedback generated by copilot
Positive Feedback
We used Olive to export and quantize all 8 Gemma4 models (E2B, E4B, 26B, 31B — base and instruct variants) across multiple formats (f16, bf16, Q4_K_M, NF4). The experience was excellent — 64 total exports, all verified. Here is our feedback.
MobiusModelBuilder pass (PR #2406)
Seamless integration with mobius for ONNX model building. The 2-pass pipeline (build → quantize) in the Olive recipe format is clean and intuitive. The Gemma4 recipes (PR microsoft/olive-recipes#381) worked out of the box.
Built-in cupy GPU acceleration for kquant
The auto-GPU detection in OnnxKQuantQuantization is excellent — we measured 28x speedup over CPU numpy when cupy + CUDA are available (tested on 4M element weights). The fallback to CPU numpy is seamless.
OnnxBnb4Quantization (NF4)
Blazing fast — 42ms for 67M parameters. Delegates correctly to ORT's MatMulBnb4Quantizer.
Suggestions for Improvement
1. Document the cupy GPU acceleration
The auto-GPU feature in kquant is great but easy to miss. Users may not know they need pip install cupy-cuda12x to get 28x faster quantization. Consider:
- Adding a note in the pass docstring or README
- Printing an info message when cupy is detected and GPU is used (vs falling back to CPU)
2. GPU acceleration for RTN quantization
OnnxBlockWiseRtnQuantization currently uses CPU-only numpy for _quantize_ndarray (used by Gather/embedding quantization). For very large models (31B+), this becomes a bottleneck. A PyTorch or cupy GPU path (similar to kquant) would be straightforward to add — the math is simple per-group min/max/scale/round.
3. Simpler path to quantize existing ONNX models
Currently, re-quantizing a pre-built ONNX model requires going through the full MobiusModelBuilder pipeline again. It would be nice to have a lighter path that takes an existing ONNX model directory as input and only runs the quantization pass(es).
Our Workflow
HuggingFace model → mobius (build ONNX) → Olive (quantize) → upload to HF
- 8 Gemma4 models × 8 variants = 64 total exports
- Variants: f16, bf16, Q4_K_M (kquant), NF4 (bnb4), plus CPU/CUDA targets
- All tested and verified end-to-end
Thank you for the great tooling!
Positive Feedback
We used Olive to export and quantize all 8 Gemma4 models (E2B, E4B, 26B, 31B — base and instruct variants) across multiple formats (f16, bf16, Q4_K_M, NF4). The experience was excellent — 64 total exports, all verified. Here is our feedback.
MobiusModelBuilder pass (PR #2406)
Seamless integration with mobius for ONNX model building. The 2-pass pipeline (build → quantize) in the Olive recipe format is clean and intuitive. The Gemma4 recipes (PR microsoft/olive-recipes#381) worked out of the box.
Built-in cupy GPU acceleration for kquant
The auto-GPU detection in
OnnxKQuantQuantizationis excellent — we measured 28x speedup over CPU numpy when cupy + CUDA are available (tested on 4M element weights). The fallback to CPU numpy is seamless.OnnxBnb4Quantization (NF4)
Blazing fast — 42ms for 67M parameters. Delegates correctly to ORT's
MatMulBnb4Quantizer.Suggestions for Improvement
1. Document the cupy GPU acceleration
The auto-GPU feature in kquant is great but easy to miss. Users may not know they need
pip install cupy-cuda12xto get 28x faster quantization. Consider:2. GPU acceleration for RTN quantization
OnnxBlockWiseRtnQuantizationcurrently uses CPU-only numpy for_quantize_ndarray(used by Gather/embedding quantization). For very large models (31B+), this becomes a bottleneck. A PyTorch or cupy GPU path (similar to kquant) would be straightforward to add — the math is simple per-group min/max/scale/round.3. Simpler path to quantize existing ONNX models
Currently, re-quantizing a pre-built ONNX model requires going through the full MobiusModelBuilder pipeline again. It would be nice to have a lighter path that takes an existing ONNX model directory as input and only runs the quantization pass(es).
Our Workflow
Thank you for the great tooling!