Skip to content

Olive feedback: excellent quantization experience with Gemma4 models #2440

@justinchuby

Description

@justinchuby

This is a feedback generated by copilot

Positive Feedback

We used Olive to export and quantize all 8 Gemma4 models (E2B, E4B, 26B, 31B — base and instruct variants) across multiple formats (f16, bf16, Q4_K_M, NF4). The experience was excellent — 64 total exports, all verified. Here is our feedback.

MobiusModelBuilder pass (PR #2406)

Seamless integration with mobius for ONNX model building. The 2-pass pipeline (build → quantize) in the Olive recipe format is clean and intuitive. The Gemma4 recipes (PR microsoft/olive-recipes#381) worked out of the box.

Built-in cupy GPU acceleration for kquant

The auto-GPU detection in OnnxKQuantQuantization is excellent — we measured 28x speedup over CPU numpy when cupy + CUDA are available (tested on 4M element weights). The fallback to CPU numpy is seamless.

OnnxBnb4Quantization (NF4)

Blazing fast — 42ms for 67M parameters. Delegates correctly to ORT's MatMulBnb4Quantizer.


Suggestions for Improvement

1. Document the cupy GPU acceleration

The auto-GPU feature in kquant is great but easy to miss. Users may not know they need pip install cupy-cuda12x to get 28x faster quantization. Consider:

  • Adding a note in the pass docstring or README
  • Printing an info message when cupy is detected and GPU is used (vs falling back to CPU)

2. GPU acceleration for RTN quantization

OnnxBlockWiseRtnQuantization currently uses CPU-only numpy for _quantize_ndarray (used by Gather/embedding quantization). For very large models (31B+), this becomes a bottleneck. A PyTorch or cupy GPU path (similar to kquant) would be straightforward to add — the math is simple per-group min/max/scale/round.

3. Simpler path to quantize existing ONNX models

Currently, re-quantizing a pre-built ONNX model requires going through the full MobiusModelBuilder pipeline again. It would be nice to have a lighter path that takes an existing ONNX model directory as input and only runs the quantization pass(es).


Our Workflow

HuggingFace model → mobius (build ONNX) → Olive (quantize) → upload to HF
  • 8 Gemma4 models × 8 variants = 64 total exports
  • Variants: f16, bf16, Q4_K_M (kquant), NF4 (bnb4), plus CPU/CUDA targets
  • All tested and verified end-to-end

Thank you for the great tooling!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions