fix: Maverick FP8 deployment - force p5.48xlarge and increase cache size

devfloor9 · devfloor9 · commit dc08b7b3485f · 2026-01-28T16:24:12.000+09:00
- Force p5.48xlarge instance type for Maverick FP8 model (requires ~400GB GPU memory)
- Increase cache volume size from 300Gi to 500Gi for large model weights
- Update docs: reorganize flow to Scout + Open WebUI as default, curl testing as optional
- Tested successfully on p5.48xlarge in ap-northeast-1
diff --git a/blueprints/inference/llama4-vllm-gpu/llama4-vllm-deployment-70b.yml b/blueprints/inference/llama4-vllm-gpu/llama4-vllm-deployment-70b.yml
@@ -125,12 +125,12 @@ spec:
           sizeLimit: 64Gi
       - name: cache
         emptyDir:
-          sizeLimit: 300Gi
-      # EKS Auto Mode: Force p4de/p5 instance for Maverick FP8 model
-      # Maverick FP8 requires ~400GB GPU memory, p4de.24xlarge or p5.48xlarge (640GB) is sufficient
+          sizeLimit: 500Gi
+      # EKS Auto Mode: Force p5.48xlarge for Maverick FP8 model
+      # Maverick FP8 requires ~400GB GPU memory, p5.48xlarge (8x H100 80GB = 640GB) is required
+      # Note: p4d.24xlarge (320GB) is NOT sufficient, p4de.24xlarge may work but p5 is recommended
       nodeSelector:
-        eks.amazonaws.com/compute-type: auto
-        eks.amazonaws.com/instance-category: p
+        node.kubernetes.io/instance-type: p5.48xlarge
       tolerations:
       - key: "nvidia.com/gpu"
         operator: "Exists"
diff --git a/website/docs/blueprints/inference/GPUs/llama4-vllm.md b/website/docs/blueprints/inference/GPUs/llama4-vllm.md
@@ -274,9 +274,54 @@ llama4-vllm-svc   ClusterIP   172.20.xxx.xx   <none>        8000/TCP   10m
 ```
 
 
-## Testing the Llama 4 Model
+## Deploy Open WebUI and Chat with Llama 4
 
-Now it's time to test the Llama 4 chat model.
+Now, let's deploy Open WebUI, which provides a ChatGPT-style chat interface to interact with the Llama 4 model.
+
+**Step 1:** Deploy Open WebUI
+
+```bash
+kubectl apply -f open-webui.yaml
+```
+
+**Output:**
+
+```text
+namespace/open-webui created
+deployment.apps/open-webui created
+service/open-webui created
+```
+
+**Step 2:** Verify the deployment
+
+```bash
+kubectl get pods -n open-webui
+```
+
+```text
+NAME                          READY   STATUS    RESTARTS   AGE
+open-webui-xxxxxxxxx-xxxxx    1/1     Running   0          2m
+```
+
+**Step 3:** Access the Open WebUI
+
+```bash
+kubectl -n open-webui port-forward svc/open-webui 8080:80
+```
+
+Open your browser and navigate to [http://localhost:8080](http://localhost:8080)
+
+**Step 4:** Register and start chatting
+
+1. Sign up with your name, email, and password
+2. Click "New Chat"
+3. Select the Llama 4 Scout model from the dropdown
+4. Start chatting!
+
+
+## Testing with curl (Optional)
+
+You can also test the Llama 4 model directly using curl commands.
 
 **Step 1:** Port-forward the vLLM service
 
@@ -357,51 +402,6 @@ curl -X POST http://localhost:8000/v1/chat/completions \
   }'
 ```
 
-## Deploy Open WebUI
-
-Now, let's deploy Open WebUI, which provides a ChatGPT-style chat interface to interact with the Llama 4 model.
-
-**Step 1:** Deploy Open WebUI
-
-```bash
-cd ai-on-eks/blueprints/inference/llama4-vllm-gpu/
-kubectl apply -f open-webui.yaml
-```
-
-**Output:**
-
-```text
-namespace/open-webui created
-deployment.apps/open-webui created
-service/open-webui created
-```
-
-**Step 2:** Verify the deployment
-
-```bash
-kubectl get pods -n open-webui
-```
-
-```text
-NAME                          READY   STATUS    RESTARTS   AGE
-open-webui-xxxxxxxxx-xxxxx    1/1     Running   0          2m
-```
-
-**Step 3:** Access the Open WebUI
-
-```bash
-kubectl -n open-webui port-forward svc/open-webui 8080:80
-```
-
-Open your browser and navigate to [http://localhost:8080](http://localhost:8080)
-
-**Step 4:** Register and start chatting
-
-1. Sign up with your name, email, and password
-2. Click "New Chat"
-3. Select the Llama 4 model from the dropdown
-4. Start chatting!
-
 
 ## Monitoring and Observability