Replies: 1 comment
-
|
Yeah, it's expected. But the paper is about that no calibration required, which is from observation that catching some peaks is the most important in quantization. Similar quantization performance achieved by only that simple method is supporting the concept. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
A new quantization method called SINQ (from SINKHORN-NORMALIZED QUANTIZATION) has been published, see paper, and promptly there is an issue in mainline
llama.cppto implement that as a "low hanging fruit".The main idea of the SINQ paper is to use what they call "double scaling". I.e., if we consider a model tensor as a collection of$S_{ij} = f_i g_j$ , where $i$ and $j$ are the tensor row ans scale, respectively. The idea seemed intriguing enough, but it adds an additional matrix - vector multiplication to the computation graph (activations times column scales $g_j$ ), which adds a significant overhead to a potential implementation in
N x Mtiles, instead of having a single float scale per tile, we represent it asik_llama.cpp(orllama.cpp) Hence, I decided to first see how this quantization method compares to the quantization types available inik_llama.cpp.In their evaluation the SINQ authors focus on the dense Qwen3 series of models (Qwen3-1.7B, Qqwen3-14B and Qqwen3-32B). In the interest of time spent, I'll use just Qwen3-14B (which I happened to have lying around on my computer). They have results for Wikitext2 perplexity (Table 1 of the paper). Given that they don't state the context used to evaluate PPL, and given that PPL values computed with$E_Q$ defined as
llama/ik_llama.cppare different from what one gets using Python evaluation tools, we will look at quantization errorwhere$Q$ is the quantization type, $PPL(Q)$ is the perplexity of the quantized model, and $PPL(bf16)$ is the perplexity of the $E_Q$ is (nearly) independent of the specifics of perplexity implementation and the context length. For Qwen3-14B the paper quotes $PPL(bf16) = 8.64$ . The table show what we get with
bf16model.ik_llama.cppfor a few context lengthsI.e., their perplexity calculation with unknown context length corresponds to
ik_llama.cppperplexity with a context length of somewhere between 512 and 768. For simplicity I will just use context of 512.Based on model sizes quoted in the paper, the output tensor and the token embedding tensor must have been left unquantized at
bf16. For Qwen3-14B the embedding size is 5120 and the vocabulary size is 151936, so these two tensors contribute 3.112 GB to the model size. Excluding output and embeddings, Qwen3-14B has 13.212 B parameters. Their "3-bit" model size is 9.25 GB, so(9.25 - 3.112)/13.212 x 8 = 3.72bits-per-weight (bpw). The "4-bit" model size is 10.56 GB, so(10.56 - 3.112)/13.212 x 8 = 4.51bpw. I.e., their "3-bit" is larger thanIQ3_K, IQ3_S, Q3_K(3.4375 bpw), while the "4-bit" is about the same asQ4_KandIQ4_K.The next table shows a comparison between the SINQ quants and a few$E_Q$ . To be consistent with the SINQ paper, output and embedding tensors are left as
ik_llama.cppquants forbf16, and the effective bpw listed is only for the repeating layers. Allik_llama.cppquantizations are "pure" (--purecommand line option), except forIQ2_KL, which usesIQ4_KSforattn_vandattn_output.A plot of the data in the above table is shown in the following graph (note that the y-axis is logarithmic)
We see that SINQ cannot even match k-quants (published ~2.5 years ago). The newer
ik_llama.cppquants achieve the same quantization error as SINQ using 0.7-0.9 fewer bits per weight.So, in short, not worth adding this new quantization method.
Beta Was this translation helpful? Give feedback.
All reactions