keras-team
diff --git a/‎.github/workflows/continuous_integration.yml‎
Lines changed: 4 additions & 4 deletions b/‎.github/workflows/continuous_integration.yml‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎examples/vision/ipynb/vivit.ipynb‎
Lines changed: 1 addition & 1 deletion b/‎examples/vision/ipynb/vivit.ipynb‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/vision/md/vivit.md‎
Lines changed: 2 additions & 2 deletions b/‎examples/vision/md/vivit.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/vision/vivit.py‎
Lines changed: 1 addition & 1 deletion b/‎examples/vision/vivit.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎guides/gptq_quantization_in_keras.py‎
Lines changed: 140 additions & 0 deletions b/‎guides/gptq_quantization_in_keras.py‎
Lines changed: 140 additions & 0 deletions
@@ -12,10 +12,10 @@ jobs:
   black:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@v2
-    - uses: actions/setup-python@v1
+    - uses: actions/checkout@v4
+    - uses: actions/setup-python@v5
       with:
-        python-version: 3.10.18
+        python-version: '3.10'
     - name: Ensure files are formatted with black
       run: |
         pip install --upgrade pip
@@ -24,7 +24,7 @@ jobs:
   docker-image:
     runs-on: ubuntu-latest
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v4
     - name: Ensure the docker image works and can start.
       run: |
         make container-test
@@ -10,7 +10,7 @@
     "\n",
     "**Author:** [Aritra Roy Gosthipaty](https://twitter.com/ariG23498), [Ayush Thakur](https://twitter.com/ayushthakur0) (equal contribution)<br>\n",
     "**Date created:** 2022/01/12<br>\n",
-    "**Last modified:**  2024/01/15<br>\n",
+    "**Last modified:**  2025/10/16<br>\n",
     "**Description:** A Transformer-based architecture for video classification."
    ]
   },
 
@@ -2,7 +2,7 @@
 
 **Author:** [Aritra Roy Gosthipaty](https://twitter.com/ariG23498), [Ayush Thakur](https://twitter.com/ayushthakur0) (equal contribution)<br>
 **Date created:** 2022/01/12<br>
-**Last modified:**  2024/01/15<br>
+**Last modified:**  2025/10/16<br>
 **Description:** A Transformer-based architecture for video classification.
 
 
@@ -370,7 +370,7 @@ def run_experiment():
 
 model = run_experiment()
 ```
-
+<div class="k-default-codeblock">
 ```
 Test accuracy: 76.72%
 Test top 5 accuracy: 97.54%
 
@@ -2,7 +2,7 @@
 Title: Video Vision Transformer
 Author: [Aritra Roy Gosthipaty](https://twitter.com/ariG23498), [Ayush Thakur](https://twitter.com/ayushthakur0) (equal contribution)
 Date created: 2022/01/12
-Last modified:  2024/01/15
+Last modified:  2025/10/16
 Description: A Transformer-based architecture for video classification.
 Accelerator: GPU
 """
 
@@ -0,0 +1,140 @@
+"""
+Title: GPTQ Quantization in Keras
+Author: [Jyotinder Singh](https://x.com/Jyotinder_Singh)
+Date created: 2025/10/16
+Last modified: 2025/10/16
+Description: How to run weight-only GPTQ quantization for Keras & KerasHub models.
+Accelerator: GPU
+"""
+
+"""
+## What is GPTQ?
+
+GPTQ ("Generative Pre-Training Quantization") is a post-training, weight-only
+quantization method that uses a second-order approximation of the loss (via a
+Hessian estimate) to minimize the error introduced when compressing weights to
+lower precision, typically 4-bit integers.
+
+Unlike standard post-training techniques, GPTQ keeps activations in
+higher-precision and only quantizes the weights. This often preserves model
+quality in low bit-width settings while still providing large storage and
+memory savings.
+
+Keras supports GPTQ quantization for KerasHub models via the
+`keras.quantizers.GPTQConfig` class.
+"""
+
+"""
+## Load a KerasHub model
+
+This guide uses the `Gemma3CausalLM` model from KerasHub, a small (1B
+parameter) causal language model.
+
+"""
+import keras
+from keras_hub.models import Gemma3CausalLM
+from datasets import load_dataset
+
+
+prompt = "Keras is a"
+
+model = Gemma3CausalLM.from_preset("gemma3_1b")
+
+outputs = model.generate(prompt, max_length=30)
+print(outputs)
+
+"""
+## Configure & run GPTQ quantization
+
+You can configure GPTQ quantization via the `keras.quantizers.GPTQConfig` class.
+
+The GPTQ configuration requires a calibration dataset and tokenizer, which it
+uses to estimate the Hessian and quantization error. Here, we use a small slice
+of the WikiText-2 dataset for calibration.
+
+You can tune several parameters to trade off speed, memory, and accuracy. The
+most important of these are `weight_bits` (the bit-width to quantize weights to)
+and `group_size` (the number of weights to quantize together). The group size
+controls the granularity of quantization: smaller groups typically yield better
+accuracy but are slower to quantize and may use more memory. A good starting
+point is `group_size=128` for 4-bit quantization (`weight_bits=4`).
+
+In this example, we first prepare a tiny calibration set, and then run GPTQ on
+the model using the `.quantize(...)` API.
+"""
+
+# Calibration slice (use a larger/representative set in practice)
+texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")["text"]
+
+calibration_dataset = [
+    s + "." for text in texts for s in map(str.strip, text.split(".")) if s
+]
+
+gptq_config = keras.quantizers.GPTQConfig(
+    dataset=calibration_dataset,
+    tokenizer=model.preprocessor.tokenizer,
+    weight_bits=4,
+    group_size=128,
+    num_samples=256,
+    sequence_length=256,
+    hessian_damping=0.01,
+    symmetric=False,
+    activation_order=False,
+)
+
+model.quantize("gptq", config=gptq_config)
+
+outputs = model.generate(prompt, max_length=30)
+print(outputs)
+
+"""
+## Model Export
+
+The GPTQ quantized model can be saved to a preset and reloaded elsewhere, just
+like any other KerasHub model.
+"""
+
+model.save_to_preset("gemma3_gptq_w4gs128_preset")
+model_from_preset = Gemma3CausalLM.from_preset("gemma3_gptq_w4gs128_preset")
+output = model_from_preset.generate(prompt, max_length=30)
+print(output)
+
+"""
+## Performance & Benchmarking
+
+Micro-benchmarks collected on a single NVIDIA 4070 Ti Super (16 GB).
+Baselines are FP32.
+
+Dataset: WikiText-2.
+
+
+| Model (preset)                    | Perplexity Increase % (↓ better) | Disk Storage Reduction Δ % (↓ better) | VRAM Reduction Δ % (↓ better) | First-token Latency Δ % (↓ better) | Throughput Δ % (↑ better) |
+| --------------------------------- | -------------------------------: | ------------------------------------: | ----------------------------: | ---------------------------------: | ------------------------: |
+| GPT2 (gpt2_base_en_cnn_dailymail) |                             1.0% |                              -50.1% ↓ |                      -41.1% ↓ |                            +0.7% ↑ |                  +20.1% ↑ |
+| OPT (opt_125m_en)                 |                            10.0% |                              -49.8% ↓ |                      -47.0% ↓ |                            +6.7% ↑ |                  -15.7% ↓ |
+| Bloom (bloom_1.1b_multi)          |                             7.0% |                              -47.0% ↓ |                      -54.0% ↓ |                            +1.8% ↑ |                  -15.7% ↓ |
+| Gemma3 (gemma3_1b)                |                             3.0% |                              -51.5% ↓ |                      -51.8% ↓ |                           +39.5% ↑ |                   +5.7% ↑ |
+
+
+Detailed benchmarking numbers and scripts are available
+[here](https://github.com/keras-team/keras/pull/21641).
+
+### Analysis
+
+There is notable reduction in disk space and VRAM usage across all models, with
+disk space savings around 50% and VRAM savings ranging from 41% to 54%. The
+reported disk savings understate the true weight compression because presets
+also include non-weight assets.
+
+Perplexity increases only marginally, indicating model quality is largely
+preserved after quantization.
+"""
+
+"""
+## Practical tips
+
+* GPTQ is a post-training technique; training after quantization is not supported.
+* Always use the model's own tokenizer for calibration.
+* Use a representative calibration set; small slices are only for demos.
+* Start with W4 group_size=128; tune per model/task.
+"""
Original file line number	Diff line number	Diff line change
`@@ -10,7 +10,7 @@`
`10`	`10`	`"\n",`
`11`	`11`	`"Author: [Aritra Roy Gosthipaty](https://twitter.com/ariG23498), [Ayush Thakur](https://twitter.com/ayushthakur0) (equal contribution)<br>\n",`
`12`	`12`	`"Date created: 2022/01/12<br>\n",`
`13`		`- "Last modified: 2024/01/15<br>\n",`
	`13`	`+ "Last modified: 2025/10/16<br>\n",`
`14`	`14`	`"Description: A Transformer-based architecture for video classification."`
`15`	`15`	`]`
`16`	`16`	`},`