|
1 | | -# Llama 4 with vLLM on Amazon EKS using Trainium2 |
| 1 | +# Llama 4 with vLLM on Trainium2 |
2 | 2 |
|
3 | | -This blueprint deploys Llama 4 models (Scout and Maverick) using vLLM with NxD Inference on AWS Trainium2 instances. |
4 | | - |
5 | | -## Overview |
6 | | - |
7 | | -Llama 4 models use a Mixture of Experts (MoE) architecture that requires significant compute resources. This blueprint leverages AWS Trainium2 (trn2) instances with the Neuron SDK for cost-effective inference. |
8 | | - |
9 | | -## Model Support |
10 | | - |
11 | | -| Model | Parameters | Experts | Instance Required | tensor_parallel_size | |
12 | | -|-------|------------|---------|-------------------|---------------------| |
13 | | -| Llama 4 Scout | 17B active / ~109B total | 16 | trn2.48xlarge | 64 | |
14 | | -| Llama 4 Maverick | 17B active / ~400B total | 128 | trn2.48xlarge | 64 | |
15 | | - |
16 | | -## Prerequisites |
17 | | - |
18 | | -1. EKS cluster with Trainium2 node support |
19 | | -2. Neuron device plugin installed |
20 | | -3. Hugging Face account with Llama 4 model access |
21 | | -4. Neuron SDK 2.21+ with vLLM-Neuron plugin |
| 3 | +This blueprint has been moved to the [AI on EKS Inference Charts](https://github.com/awslabs/ai-on-eks-charts) repository. |
22 | 4 |
|
23 | 5 | ## Deployment |
24 | 6 |
|
25 | | -### Step 1: Export Hugging Face Token |
26 | | - |
27 | | -```bash |
28 | | -export HUGGING_FACE_HUB_TOKEN=$(echo -n "your-hf-token" | base64) |
29 | | -``` |
30 | | - |
31 | | -### Step 2: Deploy Llama 4 Scout |
32 | | - |
33 | | -```bash |
34 | | -envsubst < llama4-vllm-trn2-deployment.yaml | kubectl apply -f - |
35 | | -``` |
36 | | - |
37 | | -### Step 3: Deploy Open WebUI (Optional) |
38 | | - |
39 | | -```bash |
40 | | -kubectl apply -f open-webui.yaml |
41 | | -kubectl -n open-webui port-forward svc/open-webui 8080:80 |
42 | | -``` |
43 | | - |
44 | | -### Step 4: Test with curl |
45 | | - |
46 | | -```bash |
47 | | -kubectl -n llama4-vllm port-forward svc/llama4-vllm-trn2-svc 8000:8000 |
48 | | - |
49 | | -# Text completion |
50 | | -curl -X POST http://localhost:8000/v1/chat/completions \ |
51 | | - -H "Content-Type: application/json" \ |
52 | | - -d '{ |
53 | | - "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", |
54 | | - "messages": [{"role": "user", "content": "Hello!"}] |
55 | | - }' |
56 | | - |
57 | | -# Multimodal (image + text) |
58 | | -curl -X POST http://localhost:8000/v1/chat/completions \ |
59 | | - -H "Content-Type: application/json" \ |
60 | | - -d '{ |
61 | | - "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct", |
62 | | - "messages": [{ |
63 | | - "role": "user", |
64 | | - "content": [ |
65 | | - {"type": "image_url", "image_url": {"url": "https://httpbin.org/image/png"}}, |
66 | | - {"type": "text", "text": "Describe this image"} |
67 | | - ] |
68 | | - }] |
69 | | - }' |
70 | | -``` |
71 | | - |
72 | | -## Deploying Maverick Model |
73 | | - |
74 | | -For the larger Maverick model with 128 experts: |
| 7 | +Deploy using Helm: |
75 | 8 |
|
76 | 9 | ```bash |
77 | | -envsubst < llama4-vllm-trn2-maverick.yaml | kubectl apply -f - |
| 10 | +# Add the Helm repository |
| 11 | +helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts |
| 12 | +helm repo update |
| 13 | + |
| 14 | +# Deploy Llama 4 Scout on Trainium2 |
| 15 | +helm install llama4-scout ai-on-eks/inference-charts \ |
| 16 | + -f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/main/charts/inference-charts/values-llama-4-scout-17b-vllm-neuron.yaml \ |
| 17 | + --set inference.modelServer.env.NEURON_COMPILED_ARTIFACTS="s3://your-bucket/llama4-neuron-artifacts/" |
| 18 | + |
| 19 | +# Deploy Llama 4 Maverick on Trainium2 |
| 20 | +helm install llama4-maverick ai-on-eks/inference-charts \ |
| 21 | + -f https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/main/charts/inference-charts/values-llama-4-maverick-17b-vllm-neuron.yaml \ |
| 22 | + --set inference.modelServer.env.NEURON_COMPILED_ARTIFACTS="s3://your-bucket/llama4-neuron-artifacts/" |
78 | 23 | ``` |
79 | 24 |
|
80 | | -## Key Configuration |
81 | | - |
82 | | -### Neuron Configuration |
83 | | - |
84 | | -The deployment includes optimized Neuron configuration for Llama 4: |
85 | | - |
86 | | -- `tensor_parallel_size=64`: Distributes model across all 64 Neuron cores |
87 | | -- `context_encoding_buckets`: Optimized bucket sizes for variable input lengths |
88 | | -- `async_mode=true`: Enables asynchronous execution for better throughput |
89 | | -- `cp_degree=16`: Context parallelism for efficient attention computation |
90 | | - |
91 | | -### Resource Requirements |
| 25 | +## Documentation |
92 | 26 |
|
93 | | -| Resource | Scout | Maverick | |
94 | | -|----------|-------|----------| |
95 | | -| Neuron Devices | 32 | 32 | |
96 | | -| CPU | 128-192 | 128-192 | |
97 | | -| Memory | 512-768 Gi | 512-768 Gi | |
98 | | -| Model Storage | 1 Ti | 2 Ti | |
99 | | - |
100 | | -## Model Compilation (Tracing) |
101 | | - |
102 | | -Before deployment, models must be compiled (traced) for Neuron. This is typically done once and the artifacts are stored: |
103 | | - |
104 | | -```python |
105 | | -# Example tracing script (run on trn2 instance) |
106 | | -import os |
107 | | -os.environ['NEURON_COMPILED_ARTIFACTS'] = "/models/traced_models/Llama-4-Scout-17B-16E-Instruct" |
108 | | - |
109 | | -# Tracing happens automatically on first run with vLLM-Neuron |
110 | | -``` |
111 | | - |
112 | | -## Cleanup |
113 | | - |
114 | | -```bash |
115 | | -kubectl delete -f open-webui.yaml |
116 | | -kubectl delete -f llama4-vllm-trn2-deployment.yaml |
117 | | -# Or for Maverick: |
118 | | -kubectl delete -f llama4-vllm-trn2-maverick.yaml |
119 | | -``` |
| 27 | +See the full documentation at: https://awslabs.github.io/ai-on-eks/docs/blueprints/inference/Neuron/llama4-trn2 |
120 | 28 |
|
121 | | -## References |
| 29 | +## Important Note |
122 | 30 |
|
123 | | -- [NxD Inference Llama 4 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html) |
124 | | -- [vLLM-Neuron Plugin](https://github.com/aws-neuron/vllm-neuron) |
125 | | -- [AWS Neuron SDK](https://awsdocs-neuron.readthedocs-hosted.com/) |
| 31 | +Llama 4 models on Trainium2 require **pre-compiled (traced) model artifacts**. See the [NxD Inference Llama 4 Tutorial](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/tutorials/llama4-tutorial.html) for compilation instructions. |
0 commit comments