Show-and-tell: my advanced-gguf-quantizer tool for NVFP4,MXFP6,Q2,Q3,Q4 #23853
Replies: 1 comment
-
|
This is the latest model I've spent the most quantizer time on thus far that shows the benefit of mixing MXFP6 with NVFP4. Speed on branch NVFP4repack+MXFP6+CUDA: Ppl/Kld Results for wiki.test: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I wanted to share my new tool that I've been working on for a while, just starting as a total newbie to the AI world about a year ago. I feel it's finally working decently enough to be useful to the llama.cpp/AI community. It's still a work in progress and by no means should be considered "PR worthy" even on its own. I'm hoping for llama.cpp community help to make it better and help find bugs. I am still learning more everyday, so this is still far from perfect.
I'm calling this an advanced-gguf-quantizer. I decided a while back to not make any PRs for these changes. This project's scope is huge, as it relies on tons of separate components all working together. It's also the backend for my other experiments and side projects. This is a standalone tool with CUDA llama.cpp as its engine. I hope this was the right approach.
There are a lot cleanup tasks to do, and more improvements to the tool itself as well as how it encodes; much of this still needs consolidating and de-slopping from AI that was recently used to merge all the micro projects together. Not all of this was 100% AI-generated but it would not have been possible to do without.
This tool leverages a bunch of techniques and quantization tricks and is designed to make the absolute highest quality llama.cpp model possible, and was originally meant for NVFP4. Now it's also quantizing MXFP6 and I've recently added Q2/Q3/Q4. I'm calling the most unique part of this as RSF (Refined Scale Fit) which I'll briefly explain further below.
Short summary of what this does:
RSF (Refine Scale Fit) is the final optimization: it does the 'what if' for each of the best candidates determined in the search stage, then makes small scale adjustments to refit and refine each NVFP4 weight scale against what the final ppl/kld data would become. This consistently brought significant improvements in quality and reduction of the model size across the board.
For example, this is Qwen3.6-27B-NVFP4-MTP:
I have experimented quite a bit with the tuning parameters and search depth as far as how much effort to spend to get better scales and values. There are still too many search candidates, so there is a lot of room to get this sped up more. A lot of effort was put into keeping as much data on device as possible, preventing unnecessary host/device copy, and working with the available memory, and I am sure with time this can become faster if anyone wants to help.
I have not yet done formal quality benchmarks, and have only confirmed coherency by evaluating the replies to my difficult questions when chatting with the model.
Another exciting thing, this does not need to be used just for NVFP4. On only one experimental try, it improved Q2 and Q3 on Qwen3.5-9B metrics, and there's room to explore further:
Hope to see what models anyone can come up with this (using different imatrix besides wikitrain as well).
I would love to know how this works on DGX or even bigger setups with more VRAM and/or power if anyone can test it, but I do not have those at my disposal (I wish I did!)
Beta Was this translation helpful? Give feedback.
All reactions