You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add missing GGUF quantization types (TQ1_0, TQ2_0, MXFP4) & move legacy types to the bottom (#2191)
* Add missing GGUF quantization types (TQ1_0, TQ2_0, MXFP4)
Sync the quantization types table with the source of truth at
huggingface.js/packages/gguf/src/quant-descriptions.ts:
- Add TQ1_0 and TQ2_0 (ternary quantization)
- Add MXFP4 (4-bit Microscaling Block Floating Point)
- Fix missing period on Q8_1 description
* Move legacy quantization types to bottom of table
Reorganize the table to place legacy types (Q8_0, Q8_1, Q5_0, Q5_1,
Q4_0, Q4_1) at the bottom with a separator row for better clarity.
| BF16 |[Wikipedia](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format)| 16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number. |
| Q8_0 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249)| 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). |
76
-
| Q8_1 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290)| 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today) |
77
75
| Q8_K |[GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305)| 8-bit quantization (`q`). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: `w = q * block_scale`. |
| Q8_0 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249)| 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). |
96
+
| Q8_1 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290)| 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). |
97
+
| Q5_0 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249)| 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). |
98
+
| Q5_1 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290)| 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). |
99
+
| Q4_0 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249)| 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). |
100
+
| Q4_1 |[GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290)| 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). |
97
101
98
102
*if there's any inaccuracy on the table above, please open a PR on [this file](https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts).*
0 commit comments