Skip to content

Commit dc08b7b

Browse files
committed
fix: Maverick FP8 deployment - force p5.48xlarge and increase cache size
- Force p5.48xlarge instance type for Maverick FP8 model (requires ~400GB GPU memory) - Increase cache volume size from 300Gi to 500Gi for large model weights - Update docs: reorganize flow to Scout + Open WebUI as default, curl testing as optional - Tested successfully on p5.48xlarge in ap-northeast-1
1 parent f95e193 commit dc08b7b

File tree

2 files changed

+52
-52
lines changed

2 files changed

+52
-52
lines changed

blueprints/inference/llama4-vllm-gpu/llama4-vllm-deployment-70b.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,12 +125,12 @@ spec:
125125
sizeLimit: 64Gi
126126
- name: cache
127127
emptyDir:
128-
sizeLimit: 300Gi
129-
# EKS Auto Mode: Force p4de/p5 instance for Maverick FP8 model
130-
# Maverick FP8 requires ~400GB GPU memory, p4de.24xlarge or p5.48xlarge (640GB) is sufficient
128+
sizeLimit: 500Gi
129+
# EKS Auto Mode: Force p5.48xlarge for Maverick FP8 model
130+
# Maverick FP8 requires ~400GB GPU memory, p5.48xlarge (8x H100 80GB = 640GB) is required
131+
# Note: p4d.24xlarge (320GB) is NOT sufficient, p4de.24xlarge may work but p5 is recommended
131132
nodeSelector:
132-
eks.amazonaws.com/compute-type: auto
133-
eks.amazonaws.com/instance-category: p
133+
node.kubernetes.io/instance-type: p5.48xlarge
134134
tolerations:
135135
- key: "nvidia.com/gpu"
136136
operator: "Exists"

website/docs/blueprints/inference/GPUs/llama4-vllm.md

Lines changed: 47 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -274,9 +274,54 @@ llama4-vllm-svc ClusterIP 172.20.xxx.xx <none> 8000/TCP 10m
274274
```
275275

276276

277-
## Testing the Llama 4 Model
277+
## Deploy Open WebUI and Chat with Llama 4
278278

279-
Now it's time to test the Llama 4 chat model.
279+
Now, let's deploy Open WebUI, which provides a ChatGPT-style chat interface to interact with the Llama 4 model.
280+
281+
**Step 1:** Deploy Open WebUI
282+
283+
```bash
284+
kubectl apply -f open-webui.yaml
285+
```
286+
287+
**Output:**
288+
289+
```text
290+
namespace/open-webui created
291+
deployment.apps/open-webui created
292+
service/open-webui created
293+
```
294+
295+
**Step 2:** Verify the deployment
296+
297+
```bash
298+
kubectl get pods -n open-webui
299+
```
300+
301+
```text
302+
NAME READY STATUS RESTARTS AGE
303+
open-webui-xxxxxxxxx-xxxxx 1/1 Running 0 2m
304+
```
305+
306+
**Step 3:** Access the Open WebUI
307+
308+
```bash
309+
kubectl -n open-webui port-forward svc/open-webui 8080:80
310+
```
311+
312+
Open your browser and navigate to [http://localhost:8080](http://localhost:8080)
313+
314+
**Step 4:** Register and start chatting
315+
316+
1. Sign up with your name, email, and password
317+
2. Click "New Chat"
318+
3. Select the Llama 4 Scout model from the dropdown
319+
4. Start chatting!
320+
321+
322+
## Testing with curl (Optional)
323+
324+
You can also test the Llama 4 model directly using curl commands.
280325

281326
**Step 1:** Port-forward the vLLM service
282327

@@ -357,51 +402,6 @@ curl -X POST http://localhost:8000/v1/chat/completions \
357402
}'
358403
```
359404

360-
## Deploy Open WebUI
361-
362-
Now, let's deploy Open WebUI, which provides a ChatGPT-style chat interface to interact with the Llama 4 model.
363-
364-
**Step 1:** Deploy Open WebUI
365-
366-
```bash
367-
cd ai-on-eks/blueprints/inference/llama4-vllm-gpu/
368-
kubectl apply -f open-webui.yaml
369-
```
370-
371-
**Output:**
372-
373-
```text
374-
namespace/open-webui created
375-
deployment.apps/open-webui created
376-
service/open-webui created
377-
```
378-
379-
**Step 2:** Verify the deployment
380-
381-
```bash
382-
kubectl get pods -n open-webui
383-
```
384-
385-
```text
386-
NAME READY STATUS RESTARTS AGE
387-
open-webui-xxxxxxxxx-xxxxx 1/1 Running 0 2m
388-
```
389-
390-
**Step 3:** Access the Open WebUI
391-
392-
```bash
393-
kubectl -n open-webui port-forward svc/open-webui 8080:80
394-
```
395-
396-
Open your browser and navigate to [http://localhost:8080](http://localhost:8080)
397-
398-
**Step 4:** Register and start chatting
399-
400-
1. Sign up with your name, email, and password
401-
2. Click "New Chat"
402-
3. Select the Llama 4 model from the dropdown
403-
4. Start chatting!
404-
405405

406406
## Monitoring and Observability
407407

0 commit comments

Comments
 (0)