You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description: Export Gemma models from KerasHub to Hugging Face and serve with vLLM for fast inference.
7
+
Accelerator: TPU and GPU
8
+
"""
9
+
10
+
"""
11
+
## Introduction
12
+
13
+
This guide demonstrates how to export Gemma models from KerasHub to the Hugging Face format and serve them using vLLM for efficient, high-throughput inference. We'll walk through the process step-by-step, from loading a pre-trained Gemma model in KerasHub to running inferences with vLLM in a Google Colab environment.
14
+
15
+
vLLM is an optimized serving engine for large language models that leverages techniques like PagedAttention to enable continuous batching and high GPU utilization. By exporting KerasHub models to a compatible format, you can take advantage of vLLM's performance benefits while starting from the Keras ecosystem
16
+
17
+
At present, this is supported only for Gemma 2 and its presets. In the future, there will be more coverage of the models in KerasHub.
18
+
19
+
**Note:** We'll perform the model export on a TPU runtime (for efficiency with larger models) and then switch to a GPU runtime for serving with vLLM, as vLLM [does not support TPU v2 on Colab](https://docs.vllm.ai/en/v0.5.5/getting_started/tpu-installation.html)
20
+
"""
21
+
22
+
"""
23
+
## Setup
24
+
25
+
First, install the required libraries. Select a TPU runtime in Colab before running these cells.
Load a pre-trained Gemma 2 model from KerasHub using the 'gemma2_instruct_2b_en' preset. This is an instruction-tuned variant suitable for conversational tasks.
42
+
43
+
**Note:** The export method needs to map the weights from Keras to safetensors, hence requiring double the RAM needed to load a preset. This is also the reason why we are running on a TPU instance in Colab as it offers more VRAM instead of GPU.
Save the files to Google Drive. This is needed because vLLM currently [does not support TPU v2 on Colab](https://docs.vllm.ai/en/v0.5.5/getting_started/tpu-installation.html) and cannot dynamically switch the backend to CPU. Switch to a different Colab GPU instance for serving after saving. If you are using Cloud TPU or GPU from the start, you may skip this step.
99
+
100
+
**Note:** the `model.safetensors` file is ~9.5GB for Gemma 2B, so ensure you have enough space in your Google Drive.
101
+
"""
102
+
103
+
fromgoogle.colabimportdrive
104
+
105
+
drive.mount("/content/drive")
106
+
107
+
drive_dir="/content/drive/MyDrive/gemma_exported"
108
+
109
+
# Remove any existing exports with the same name
110
+
ifos.path.exists(drive_dir):
111
+
shutil.rmtree(drive_dir)
112
+
print("✅ Existing export removed")
113
+
114
+
# Copy the exported model to Google Drive
115
+
shutil.copytree(SERVABLE_CKPT_DIR, drive_dir)
116
+
print("✅ Model copied to Google Drive")
117
+
"""
118
+
Verify the file sizes to ensure no corruption during copy. Here's how they should appear:
You've now successfully exported a KerasHub Gemma model to Hugging Face format and served it with vLLM for efficient inference. This setup enables high-throughput generation, suitable for production or batch processing.
212
+
213
+
Experiment with different prompts, sampling parameters, or larger Gemma variants (ensure sufficient GPU memory). For deployment beyond Colab, consider Docker containers or cloud instances.
0 commit comments