Show-and-tell: my advanced-gguf-quantizer tool for NVFP4,MXFP6,Q2,Q3,Q4 #23853

michaelw9999 · 2026-05-29T04:51:03Z

michaelw9999
May 29, 2026

Hi everyone,
I wanted to share my new tool that I've been working on for a while, just starting as a total newbie to the AI world about a year ago. I feel it's finally working decently enough to be useful to the llama.cpp/AI community. It's still a work in progress and by no means should be considered "PR worthy" even on its own. I'm hoping for llama.cpp community help to make it better and help find bugs. I am still learning more everyday, so this is still far from perfect.

I'm calling this an advanced-gguf-quantizer. I decided a while back to not make any PRs for these changes. This project's scope is huge, as it relies on tons of separate components all working together. It's also the backend for my other experiments and side projects. This is a standalone tool with CUDA llama.cpp as its engine. I hope this was the right approach.

There are a lot cleanup tasks to do, and more improvements to the tool itself as well as how it encodes; much of this still needs consolidating and de-slopping from AI that was recently used to merge all the micro projects together. Not all of this was 100% AI-generated but it would not have been possible to do without.

This tool leverages a bunch of techniques and quantization tricks and is designed to make the absolute highest quality llama.cpp model possible, and was originally meant for NVFP4. Now it's also quantizing MXFP6 and I've recently added Q2/Q3/Q4. I'm calling the most unique part of this as RSF (Refined Scale Fit) which I'll briefly explain further below.

Short summary of what this does:

Uses imatrix for activation aware quantization for NVFP4, MXFP6, and mixed NVFP4/MXFP8 models.
Takes an input BF16 model and kld file, runs a series of definable different quantization techniques and candidate searches to evaluate ppl/kld/all metrics layer by layer, and creates a final highly optimized calibrated final gguf model, trying to reach full pareto optimization combined with every possible candidate. It will automatically determine the best distribution of layers to use based off the specified parameters and search depth. This can be done as a fast quick quantize mode, or a deeper search that can take hours. A finished working model can be re-loaded later to continue optimizing further if desired.
Uses CUDA accelerated encoding and evaluation for llama-perplexity, llama-imatrix, and patching the model during the run; this otherwise would not be feasible.
Has optional recipe and project files with keyed checkpoints to make reproducible models.
Can start/stop and resume from checkpoints without losing much progress.
Provides a final quantization report with tensor manifest and assignment log.
Can repair and edit GGUFs inplace without needing to requantize the entire model again.
Has an optional text-based UI and wizard to create new projects/load existing projects and attempt to make it user friendly; this still needs more improvement.
Has various SKILLS.md files and instructions for agents to monitor and understand/adjust run parameters as needed.

RSF (Refine Scale Fit) is the final optimization: it does the 'what if' for each of the best candidates determined in the search stage, then makes small scale adjustments to refit and refine each NVFP4 weight scale against what the final ppl/kld data would become. This consistently brought significant improvements in quality and reduction of the model size across the board.
For example, this is Qwen3.6-27B-NVFP4-MTP:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.128133 ±   0.047613
Mean PPL(base)                :   6.900856 ±   0.045374
Cor(ln(PPL(Q)), ln(PPL(base))):  98.54%
Mean ln(PPL(Q)/PPL(base))     :   0.032404 ±   0.001137
Mean PPL(Q)/PPL(base)         :   1.032935 ±   0.001175
Mean PPL(Q)-PPL(base)         :   0.227277 ±   0.008252

====== KL divergence statistics ======
Mean    KLD:   0.058781 ±   0.000935
Maximum KLD:  20.671190
99.9%   KLD:   4.869176
99.0%   KLD:   0.591435
95.0%   KLD:   0.165489
90.0%   KLD:   0.096133
Median  KLD:   0.019355
10.0%   KLD:   0.000493
 5.0%   KLD:   0.000136
 1.0%   KLD:   0.000018
 0.1%   KLD:   0.000002
Minimum KLD:  -0.000024

====== Token probability statistics ======
Mean    Δp: -0.342 ± 0.017 %
Maximum Δp: 99.944%
99.9%   Δp: 45.234%
99.0%   Δp: 16.022%
95.0%   Δp:  7.179%
90.0%   Δp:  4.097%
75.0%   Δp:  0.790%
Median  Δp: -0.006%
25.0%   Δp: -1.105%
10.0%   Δp: -4.813%
 5.0%   Δp: -8.400%
 1.0%   Δp: -21.551%
 0.1%   Δp: -63.523%
Minimum Δp: -99.502%
RMS Δp    :  6.613 ± 0.060 %
Same top p: 90.450 ± 0.076 %
Top flip weight:   0.009715
Top prob RMSE  :   0.080946
Entropy RMSE   :   0.248197

source	size	ppl	ln/ppl ratio	mean kld	p99 kld	p999 kld	max kld	RMS Δp	Same top p	pp512 tk/s	tg512 tp/s
Q4_0	16.04GB	7.033601	0.019053	0.037750	0.334307	3.459529	27.287447	5.246%	92.49%	3664.31	81.74
a-g-q	15.95GB	7.128133	0.032404	0.058781	0.591435	4.869176	20.671190	6.613%	90.45%	5612.53	80.76
Converted ModelOpt	19.65GB	7.267070	0.051708	0.078689	0.803567	5.857053	26.001289	7.700%	89.091%	5566.96	64.03

I have experimented quite a bit with the tuning parameters and search depth as far as how much effort to spend to get better scales and values. There are still too many search candidates, so there is a lot of room to get this sped up more. A lot of effort was put into keeping as much data on device as possible, preventing unnecessary host/device copy, and working with the available memory, and I am sure with time this can become faster if anyone wants to help.

I have not yet done formal quality benchmarks, and have only confirmed coherency by evaluating the replies to my difficult questions when chatting with the model.

Another exciting thing, this does not need to be used just for NVFP4. On only one experimental try, it improved Q2 and Q3 on Qwen3.5-9B metrics, and there's room to explore further:

model	ppl	ln/ppl ratio	mean kld	p99 kld	p999 kld	max kld	RMS Δp	Same top p
Q2_K baseline	9.29	0.126563	0.215156	1.869003	7.017129	28.695597	12.958%	80.147%
Q2_K a-g-q	9.23	0.120322	0.213368	1.868077	6.809549	28.500494	12.843%	80.248%
Q3_K_M baseline	8.58	0.047611	0.076897	0.655403	3.245732	24.672821	7.671%	87.957%
Q3_K_M a-g-q	8.52	0.039882	0.075962	0.636669	3.279779	23.609482	7.593%	88.085%
NVFP4 a-g-q	8.41	0.026870	0.062676	0.497322	2.517230	24.389235	6.792%	88.816%

Hope to see what models anyone can come up with this (using different imatrix besides wikitrain as well).
I would love to know how this works on DGX or even bigger setups with more VRAM and/or power if anyone can test it, but I do not have those at my disposal (I wish I did!)

michaelw9999 · 2026-06-01T04:35:56Z

michaelw9999
Jun 1, 2026
Author

This is the latest model I've spent the most quantizer time on thus far that shows the benefit of mixing MXFP6 with NVFP4.
Qwen3.6-27B-NVFP4-MXFP6-MTP-GGUF
This is a combination of NVFP4, Q4_K, and MXFP6 and Q6_K.
The quantizer determined this suggested combination based off the inputted requirements and limitations.
The MTP tensors are quantized to MXFP6, but the majority of the model remains as NVFP4 to keep the speed up.
NVFP4 repack with additional tuning further increases NVFP4 prefill ~10% and is nearly equivalent on tokengen, while the use of its input_scale shows a more significant increase in model quality than seen previously.

Speed on branch NVFP4repack+MXFP6+CUDA:

| qwen35 27B NVFP4-MXFP6|  15.23 GiB |    27.32 B | CUDA       |  99 |  pp512 |      5675.78 ± 94.88 |
| qwen35 27B NVFP4-MXFP6|  15.23 GiB |    27.32 B | CUDA       |  99 |  tg128 |         78.63 ± 1.49

Ppl/Kld Results for wiki.test:

====== Perplexity statistics ======
Mean PPL(Q)                   :   7.085396 ±   0.047221
Mean PPL(base)                :   6.900856 ±   0.045374
Cor(ln(PPL(Q)), ln(PPL(base))):  98.71%
Mean ln(PPL(Q)/PPL(base))     :   0.026390 ±   0.001067
Mean PPL(Q)/PPL(base)         :   1.026742 ±   0.001095
Mean PPL(Q)-PPL(base)         :   0.184541 ±   0.007658

====== KL divergence statistics ======
Mean    KLD:   0.054343 ±   0.000910
Maximum KLD:  22.079569
99.9%   KLD:   4.463678
99.0%   KLD:   0.534394
95.0%   KLD:   0.152599
90.0%   KLD:   0.088228
Median  KLD:   0.018090
10.0%   KLD:   0.000447
 5.0%   KLD:   0.000122
 1.0%   KLD:   0.000016
 0.1%   KLD:   0.000002
Minimum KLD:  -0.000056

====== Token probability statistics ======
Mean    Δp: -0.220 ± 0.017 %
Maximum Δp: 99.832%
99.9%   Δp: 45.711%
99.0%   Δp: 15.914%
95.0%   Δp:  7.224%
90.0%   Δp:  4.181%
75.0%   Δp:  0.864%
Median  Δp: -0.002%
25.0%   Δp: -0.956%
10.0%   Δp: -4.468%
 5.0%   Δp: -7.877%
 1.0%   Δp: -20.814%
 0.1%   Δp: -60.731%
Minimum Δp: -99.832%
RMS Δp    :  6.452 ± 0.059 %
Same top p: 90.869 ± 0.075 %

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show-and-tell: my advanced-gguf-quantizer tool for NVFP4,MXFP6,Q2,Q3,Q4 #23853

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Show-and-tell: my advanced-gguf-quantizer tool for NVFP4,MXFP6,Q2,Q3,Q4 #23853

Uh oh!

Uh oh!

michaelw9999 May 29, 2026

Replies: 1 comment

Uh oh!

michaelw9999 Jun 1, 2026 Author

michaelw9999
May 29, 2026

michaelw9999
Jun 1, 2026
Author