Skip to content

Commit 0e5c50f

Browse files
committed
refactor: Move Llama 4 Trainium2 deployment to Helm charts
- Remove YAML deployment files from main repo - Update documentation to use Helm-based deployment - Add model compilation (tracing) requirement documentation - Reference ai-on-eks-charts repo for deployment configs
1 parent 1cf3115 commit 0e5c50f

File tree

5 files changed

+171
-737
lines changed

5 files changed

+171
-737
lines changed
Lines changed: 20 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,125 +1,31 @@
1-
# Llama 4 with vLLM on Amazon EKS using Trainium2
1+
# Llama 4 with vLLM on Trainium2
22

3-
This blueprint deploys Llama 4 models (Scout and Maverick) using vLLM with NxD Inference on AWS Trainium2 instances.
4-
5-
## Overview
6-
7-
Llama 4 models use a Mixture of Experts (MoE) architecture that requires significant compute resources. This blueprint leverages AWS Trainium2 (trn2) instances with the Neuron SDK for cost-effective inference.
8-
9-
## Model Support
10-
11-
| Model | Parameters | Experts | Instance Required | tensor_parallel_size |
12-
|-------|------------|---------|-------------------|---------------------|
13-
| Llama 4 Scout | 17B active / ~109B total | 16 | trn2.48xlarge | 64 |
14-
| Llama 4 Maverick | 17B active / ~400B total | 128 | trn2.48xlarge | 64 |
15-
16-
## Prerequisites
17-
18-
1. EKS cluster with Trainium2 node support
19-
2. Neuron device plugin installed
20-
3. Hugging Face account with Llama 4 model access
21-
4. Neuron SDK 2.21+ with vLLM-Neuron plugin
3+
This blueprint has been moved to the [AI on EKS Inference Charts](https://github.com/awslabs/ai-on-eks-charts) repository.
224

235
## Deployment
246

25-
### Step 1: Export Hugging Face Token
26-
27-
```bash
28-
export HUGGING_FACE_HUB_TOKEN=$(echo -n "your-hf-token" | base64)
29-
```
30-
31-
### Step 2: Deploy Llama 4 Scout
32-
33-
```bash
34-
envsubst < llama4-vllm-trn2-deployment.yaml | kubectl apply -f -
35-
```
36-
37-
### Step 3: Deploy Open WebUI (Optional)
38-
39-
```bash
40-
kubectl apply -f open-webui.yaml
41-
kubectl -n open-webui port-forward svc/open-webui 8080:80
42-
```
43-
44-
### Step 4: Test with curl
45-
46-
```bash
47-
kubectl -n llama4-vllm port-forward svc/llama4-vllm-trn2-svc 8000:8000
48-
49-
# Text completion
50-
curl -X POST http://localhost:8000/v1/chat/completions \
51-
-H "Content-Type: application/json" \
52-
-d '{
53-
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
54-
"messages": [{"role": "user", "content": "Hello!"}]
55-
}'
56-
57-
# Multimodal (image + text)
58-
curl -X POST http://localhost:8000/v1/chat/completions \
59-
-H "Content-Type: application/json" \
60-
-d '{
61-
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
62-
"messages": [{
63-
"role": "user",
64-
"content": [
65-
{"type": "image_url", "image_url": {"url": "https://httpbin.org/image/png"}},
66-
{"type": "text", "text": "Describe this image"}
67-
]
68-
}]
69-
}'
70-
```
71-
72-
## Deploying Maverick Model
73-
74-
For the larger Maverick model with 128 experts:
7+
Deploy using Helm:
758

769
```bash
77-
envsubst < llama4-vllm-trn2-maverick.yaml | kubectl apply -f -
10+
# Add the Helm repository
11+
helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts
12+
helm repo update
13+
14+
# Deploy Llama 4 Scout on Trainium2
15+
helm install llama4-scout ai-on-eks/inference-charts \
16+
-f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/main/charts/inference-charts/values-llama-4-scout-17b-vllm-neuron.yaml \
17+
--set inference.modelServer.env.NEURON_COMPILED_ARTIFACTS="s3://your-bucket/llama4-neuron-artifacts/"
18+
19+
# Deploy Llama 4 Maverick on Trainium2
20+
helm install llama4-maverick ai-on-eks/inference-charts \
21+
-f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/main/charts/inference-charts/values-llama-4-maverick-17b-vllm-neuron.yaml \
22+
--set inference.modelServer.env.NEURON_COMPILED_ARTIFACTS="s3://your-bucket/llama4-neuron-artifacts/"
7823
```
7924

80-
## Key Configuration
81-
82-
### Neuron Configuration
83-
84-
The deployment includes optimized Neuron configuration for Llama 4:
85-
86-
- `tensor_parallel_size=64`: Distributes model across all 64 Neuron cores
87-
- `context_encoding_buckets`: Optimized bucket sizes for variable input lengths
88-
- `async_mode=true`: Enables asynchronous execution for better throughput
89-
- `cp_degree=16`: Context parallelism for efficient attention computation
90-
91-
### Resource Requirements
25+
## Documentation
9226

93-
| Resource | Scout | Maverick |
94-
|----------|-------|----------|
95-
| Neuron Devices | 32 | 32 |
96-
| CPU | 128-192 | 128-192 |
97-
| Memory | 512-768 Gi | 512-768 Gi |
98-
| Model Storage | 1 Ti | 2 Ti |
99-
100-
## Model Compilation (Tracing)
101-
102-
Before deployment, models must be compiled (traced) for Neuron. This is typically done once and the artifacts are stored:
103-
104-
```python
105-
# Example tracing script (run on trn2 instance)
106-
import os
107-
os.environ['NEURON_COMPILED_ARTIFACTS'] = "/models/traced_models/Llama-4-Scout-17B-16E-Instruct"
108-
109-
# Tracing happens automatically on first run with vLLM-Neuron
110-
```
111-
112-
## Cleanup
113-
114-
```bash
115-
kubectl delete -f open-webui.yaml
116-
kubectl delete -f llama4-vllm-trn2-deployment.yaml
117-
# Or for Maverick:
118-
kubectl delete -f llama4-vllm-trn2-maverick.yaml
119-
```
27+
See the full documentation at: https://awslabs.github.io/ai-on-eks/docs/blueprints/inference/Neuron/llama4-trn2
12028

121-
## References
29+
## Important Note
12230

123-
- [NxD Inference Llama 4 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html)
124-
- [vLLM-Neuron Plugin](https://github.com/aws-neuron/vllm-neuron)
125-
- [AWS Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/)
31+
Llama 4 models on Trainium2 require **pre-compiled (traced) model artifacts**. See the [NxD Inference Llama 4 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html) for compilation instructions.

blueprints/inference/llama4-vllm-trn2/llama4-vllm-trn2-deployment.yaml

Lines changed: 0 additions & 202 deletions
This file was deleted.

0 commit comments

Comments
 (0)