|
167 | 167 | "Dataset: WikiText-2.\n", |
168 | 168 | "\n", |
169 | 169 | "\n", |
170 | | - "| Model (preset) | Perplexity Increase % (↓ better) | Disk Storage Reduction Δ % (↓ better) | VRAM Reduction Δ % (↓ better) | First-token Latency Δ % (↓ better) | Throughput Δ % (↑ better) |\n", |
171 | | - "| ------------------------------------------- | -------------------------------: | ------------------------------------: | ----------------------------: | ---------------------------------: | ------------------------: |\n", |
172 | | - "| GPT2 (gpt2_base_en_cnn_dailymail) | 1.0% | -50.1% ↓ | -41.1% ↓ | +0.7% ↑ | +20.1% ↑ |\n", |
173 | | - "| OPT (opt_125m_en) | 10.0% | -49.8% ↓ | -47.0% ↓ | +6.7% ↑ | -15.7% ↓ |\n", |
174 | | - "| Bloom (bloom_1.1b_multi) | 7.0% | -47.0% ↓ | -54.0% ↓ | +1.8% ↑ | -15.7% ↓ |\n", |
175 | | - "| Gemma3 (gemma3_1b) | 3.0% | -51.5% ↓ | -51.8% ↓ | +39.5% ↑ | +5.7% ↑ |\n", |
| 170 | + "| Model (preset) | Perplexity Increase % (\u2193 better) | Disk Storage Reduction \u0394 % (\u2193 better) | VRAM Reduction \u0394 % (\u2193 better) | First-token Latency \u0394 % (\u2193 better) | Throughput \u0394 % (\u2191 better) |\n", |
| 171 | + "| --------------------------------- | -------------------------------: | ------------------------------------: | ----------------------------: | ---------------------------------: | ------------------------: |\n", |
| 172 | + "| GPT2 (gpt2_base_en_cnn_dailymail) | 1.0% | -50.1% \u2193 | -41.1% \u2193 | +0.7% \u2191 | +20.1% \u2191 |\n", |
| 173 | + "| OPT (opt_125m_en) | 10.0% | -49.8% \u2193 | -47.0% \u2193 | +6.7% \u2191 | -15.7% \u2193 |\n", |
| 174 | + "| Bloom (bloom_1.1b_multi) | 7.0% | -47.0% \u2193 | -54.0% \u2193 | +1.8% \u2191 | -15.7% \u2193 |\n", |
| 175 | + "| Gemma3 (gemma3_1b) | 3.0% | -51.5% \u2193 | -51.8% \u2193 | +39.5% \u2191 | +5.7% \u2191 |\n", |
176 | 176 | "\n", |
177 | 177 | "\n", |
178 | 178 | "Detailed benchmarking numbers and scripts are available\n", |
|
191 | 191 | }, |
192 | 192 | { |
193 | 193 | "cell_type": "markdown", |
194 | | - "source": "## GPTQ vs AWQ?\n\nBoth GPTQ and AWQ are weight-only quantization methods that require calibration\ndata. Here's how to choose between them:\n\n| Aspect | GPTQ | AWQ |\n| ------ | ---- | --- |\n| **Algorithm** | Hessian-based second-order optimization | Grid search for activation-aware scales |\n| **Quantization speed** | Slower (requires Hessian estimation) | Faster (no Hessian computation) |\n| **Bit-widths supported** | 2/3/4/8-bit | Only 4-bit supported for now |\n| **Accuracy** | Often slightly better on decoder LLMs | Competitive, especially on encoder models |\n| **Memory during quantization** | Higher (Hessian storage) | Lower |\n| **Calibration sensitivity** | May overfit calibration set, affecting out-of-distribution performance | Less prone to overfitting |\n\n**Choose GPTQ when:**\n\n* You need bit-widths other than 4 (e.g., 2-bit or 8-bit).\n* Maximum accuracy is critical and you can afford longer quantization time.\n* You're working with decoder-only LLMs where GPTQ may have a slight edge.\n\n**Choose AWQ when:**\n\n* You need faster quantization (AWQ is typically 2-3x faster than GPTQ).\n* Memory during quantization is constrained.\n* 4-bit is sufficient for your use case.\n* Your model will be used on diverse/out-of-distribution data (AWQ is less prone to overfitting on calibration data).", |
195 | | - "metadata": {} |
| 194 | + "metadata": { |
| 195 | + "colab_type": "text" |
| 196 | + }, |
| 197 | + "source": [ |
| 198 | + "## GPTQ vs AWQ?\n", |
| 199 | + "\n", |
| 200 | + "Both GPTQ and AWQ are weight-only quantization methods that require calibration\n", |
| 201 | + "data. Here's how to choose between them:\n", |
| 202 | + "\n", |
| 203 | + "| Aspect | GPTQ | AWQ |\n", |
| 204 | + "| ------ | ---- | --- |\n", |
| 205 | + "| **Algorithm** | Hessian-based second-order optimization | Grid search for activation-aware scales |\n", |
| 206 | + "| **Quantization speed** | Slower (requires Hessian estimation) | Faster (no Hessian computation) |\n", |
| 207 | + "| **Bit-widths supported** | 2/3/4/8-bit | 4-bit |\n", |
| 208 | + "| **Accuracy** | Often slightly better on decoder LLMs | Competitive, especially on encoder models |\n", |
| 209 | + "| **Memory during quantization** | Higher (Hessian storage) | Lower |\n", |
| 210 | + "| **Calibration sensitivity** | May overfit calibration set, affecting out-of-distribution performance | Less prone to overfitting |\n", |
| 211 | + "\n", |
| 212 | + "**Choose GPTQ when:**\n", |
| 213 | + "\n", |
| 214 | + "* You need bit-widths other than 4 (e.g., 2-bit or 8-bit).\n", |
| 215 | + "* Maximum accuracy is critical and you can afford longer quantization time.\n", |
| 216 | + "* You're working with decoder-only LLMs where GPTQ may have a slight edge.\n", |
| 217 | + "\n", |
| 218 | + "**Choose AWQ when:**\n", |
| 219 | + "\n", |
| 220 | + "* You need faster quantization (AWQ is typically 2-3x faster than GPTQ).\n", |
| 221 | + "* Memory during quantization is constrained.\n", |
| 222 | + "* 4-bit is sufficient for your use case.\n", |
| 223 | + "* Your model will be used on diverse/out-of-distribution data (AWQ is less prone to overfitting on calibration data)." |
| 224 | + ] |
196 | 225 | }, |
197 | 226 | { |
198 | 227 | "cell_type": "markdown", |
|
207 | 236 | "* Use a representative calibration set; small slices are only for demos.\n", |
208 | 237 | "* Start with W4 group_size=128; tune per model/task." |
209 | 238 | ] |
210 | | - }, |
211 | | - { |
212 | | - "cell_type": "markdown", |
213 | | - "metadata": {}, |
214 | | - "source": [] |
215 | 239 | } |
216 | 240 | ], |
217 | 241 | "metadata": { |
|
0 commit comments