Skip to content

Commit 178bf01

Browse files
committed
Add stage 1 playbook: PVC model + LLMInferenceService on xKS
End-to-end walkthrough: create PVC, download model, deploy LLMInferenceService, send inference request, and trace request through the networking stack (Gateway -> Router -> Scheduler -> vLLM). Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com>
1 parent 01a96ca commit 178bf01

File tree

1 file changed

+212
-0
lines changed

1 file changed

+212
-0
lines changed

docs/playbooks/stage1.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Stage 1: End-to-end PVC model + LLMInferenceService on xKS
2+
3+
Deploy a model from a PVC-backed storage, expose it via KServe LLMInferenceService, send an inference request, and trace the request through the networking stack.
4+
5+
**Prerequisites:**
6+
- rhaii-on-xks deployed (`make deploy-all`)
7+
- Inference gateway configured (`./scripts/setup-gateway.sh`)
8+
9+
---
10+
11+
## Step 1: Create namespace
12+
13+
```bash
14+
export NAMESPACE=llm-inference
15+
kubectl create namespace $NAMESPACE
16+
```
17+
18+
## Step 2: Create PVC
19+
20+
```bash
21+
kubectl apply -n $NAMESPACE -f - <<'EOF'
22+
apiVersion: v1
23+
kind: PersistentVolumeClaim
24+
metadata:
25+
name: qwen2-7b-model
26+
spec:
27+
accessModes:
28+
- ReadWriteOnce
29+
resources:
30+
requests:
31+
storage: 20Gi
32+
EOF
33+
```
34+
35+
## Step 3: Download model into PVC
36+
37+
```bash
38+
kubectl apply -n $NAMESPACE -f - <<'EOF'
39+
apiVersion: batch/v1
40+
kind: Job
41+
metadata:
42+
name: qwen2-7b-downloader
43+
spec:
44+
backoffLimit: 3
45+
ttlSecondsAfterFinished: 300
46+
template:
47+
spec:
48+
restartPolicy: OnFailure
49+
containers:
50+
- name: downloader
51+
image: python:3.11-slim
52+
command:
53+
- bash
54+
- -c
55+
- |
56+
pip install huggingface_hub==0.30.2
57+
python3 -c "
58+
from huggingface_hub import snapshot_download
59+
snapshot_download('Qwen/Qwen2.5-7B-Instruct', local_dir='/models/Qwen2.5-7B-Instruct')
60+
print('Download complete')
61+
"
62+
volumeMounts:
63+
- name: model-storage
64+
mountPath: /models
65+
resources:
66+
requests:
67+
cpu: "500m"
68+
memory: 512Mi
69+
volumes:
70+
- name: model-storage
71+
persistentVolumeClaim:
72+
claimName: qwen2-7b-model
73+
EOF
74+
75+
# Watch progress
76+
kubectl logs -n $NAMESPACE job/qwen2-7b-downloader -f
77+
78+
# Verify completion
79+
kubectl get job qwen2-7b-downloader -n $NAMESPACE
80+
```
81+
82+
## Step 4: Configure pull secret
83+
84+
```bash
85+
kubectl get secret redhat-pull-secret -n istio-system -o json | \
86+
jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp,
87+
.metadata.annotations, .metadata.labels, .metadata.ownerReferences) |
88+
.metadata.namespace = "'$NAMESPACE'"' | \
89+
kubectl apply -f -
90+
91+
kubectl patch serviceaccount default -n $NAMESPACE \
92+
-p '{"imagePullSecrets": [{"name": "redhat-pull-secret"}]}'
93+
```
94+
95+
## Step 5: Deploy LLMInferenceService
96+
97+
```bash
98+
kubectl apply -n $NAMESPACE -f - <<'EOF'
99+
apiVersion: serving.kserve.io/v1alpha1
100+
kind: LLMInferenceService
101+
metadata:
102+
name: qwen2-7b
103+
spec:
104+
model:
105+
name: Qwen/Qwen2.5-7B-Instruct
106+
uri: pvc://qwen2-7b-model/Qwen2.5-7B-Instruct
107+
replicas: 1
108+
router:
109+
gateway: {}
110+
route: {}
111+
scheduler: {}
112+
template:
113+
tolerations:
114+
- key: "nvidia.com/gpu"
115+
operator: "Equal"
116+
value: "present"
117+
effect: "NoSchedule"
118+
containers:
119+
- name: main
120+
resources:
121+
limits:
122+
cpu: "4"
123+
memory: 64Gi
124+
nvidia.com/gpu: "1"
125+
requests:
126+
cpu: "2"
127+
memory: 32Gi
128+
nvidia.com/gpu: "1"
129+
livenessProbe:
130+
httpGet:
131+
path: /health
132+
port: 8000
133+
scheme: HTTPS
134+
initialDelaySeconds: 120
135+
periodSeconds: 30
136+
timeoutSeconds: 30
137+
failureThreshold: 5
138+
EOF
139+
140+
# Watch until READY=True
141+
kubectl get llmisvc -n $NAMESPACE -w
142+
```
143+
144+
## Step 6: Send inference request
145+
146+
```bash
147+
SERVICE_URL=$(kubectl get llmisvc qwen2-7b -n $NAMESPACE -o jsonpath='{.status.url}')
148+
echo "Service URL: $SERVICE_URL"
149+
150+
curl -s -k -X POST "${SERVICE_URL}/v1/chat/completions" \
151+
-H "Content-Type: application/json" \
152+
-d '{
153+
"model": "Qwen/Qwen2.5-7B-Instruct",
154+
"messages": [{"role": "user", "content": "Hello"}],
155+
"max_tokens": 20
156+
}' | python3 -m json.tool
157+
```
158+
159+
## Step 7: Trace request through the networking stack
160+
161+
```bash
162+
# Gateway (ingress)
163+
echo "=== Gateway ==="
164+
kubectl get gateway -n opendatahub inference-gateway
165+
kubectl logs -n opendatahub \
166+
-l gateway.networking.k8s.io/gateway-name=inference-gateway \
167+
--tail=10
168+
169+
# HTTPRoute (created automatically by KServe)
170+
echo ""
171+
echo "=== HTTPRoute ==="
172+
kubectl get httproute -n $NAMESPACE
173+
174+
# Router (routes to scheduler)
175+
echo ""
176+
echo "=== Router Logs ==="
177+
kubectl logs -n $NAMESPACE \
178+
-l serving.kserve.io/llminferenceservice=qwen2-7b \
179+
-c router --tail=10
180+
181+
# Scheduler (picks best replica)
182+
echo ""
183+
echo "=== Scheduler Logs ==="
184+
kubectl logs -n $NAMESPACE \
185+
-l serving.kserve.io/llminferenceservice=qwen2-7b \
186+
-c scheduler --tail=10
187+
188+
# vLLM (inference)
189+
echo ""
190+
echo "=== vLLM Logs ==="
191+
kubectl logs -n $NAMESPACE \
192+
-l serving.kserve.io/llminferenceservice=qwen2-7b,serving.kserve.io/component=model \
193+
--tail=10
194+
```
195+
196+
### Request flow
197+
198+
```
199+
Client
200+
|
201+
v
202+
Gateway (Istio, opendatahub/inference-gateway, port 80)
203+
|
204+
v
205+
Router -> Scheduler (router-scheduler pod, mTLS)
206+
|
207+
v
208+
vLLM Pod (GPU, serves model from PVC)
209+
|
210+
v
211+
Response back: vLLM -> Scheduler -> Router -> Gateway -> Client
212+
```

0 commit comments

Comments
 (0)