|
| 1 | +# Stage 1: End-to-end PVC model + LLMInferenceService on xKS |
| 2 | + |
| 3 | +Deploy a model from a PVC-backed storage, expose it via KServe LLMInferenceService, send an inference request, and trace the request through the networking stack. |
| 4 | + |
| 5 | +**Prerequisites:** |
| 6 | +- rhaii-on-xks deployed (`make deploy-all`) |
| 7 | +- Inference gateway configured (`./scripts/setup-gateway.sh`) |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Step 1: Create namespace |
| 12 | + |
| 13 | +```bash |
| 14 | +export NAMESPACE=llm-inference |
| 15 | +kubectl create namespace $NAMESPACE |
| 16 | +``` |
| 17 | + |
| 18 | +## Step 2: Create PVC |
| 19 | + |
| 20 | +```bash |
| 21 | +kubectl apply -n $NAMESPACE -f - <<'EOF' |
| 22 | +apiVersion: v1 |
| 23 | +kind: PersistentVolumeClaim |
| 24 | +metadata: |
| 25 | + name: qwen2-7b-model |
| 26 | +spec: |
| 27 | + accessModes: |
| 28 | + - ReadWriteOnce |
| 29 | + resources: |
| 30 | + requests: |
| 31 | + storage: 20Gi |
| 32 | +EOF |
| 33 | +``` |
| 34 | + |
| 35 | +## Step 3: Download model into PVC |
| 36 | + |
| 37 | +```bash |
| 38 | +kubectl apply -n $NAMESPACE -f - <<'EOF' |
| 39 | +apiVersion: batch/v1 |
| 40 | +kind: Job |
| 41 | +metadata: |
| 42 | + name: qwen2-7b-downloader |
| 43 | +spec: |
| 44 | + backoffLimit: 3 |
| 45 | + ttlSecondsAfterFinished: 300 |
| 46 | + template: |
| 47 | + spec: |
| 48 | + restartPolicy: OnFailure |
| 49 | + containers: |
| 50 | + - name: downloader |
| 51 | + image: python:3.11-slim |
| 52 | + command: |
| 53 | + - bash |
| 54 | + - -c |
| 55 | + - | |
| 56 | + pip install huggingface_hub==0.30.2 |
| 57 | + python3 -c " |
| 58 | + from huggingface_hub import snapshot_download |
| 59 | + snapshot_download('Qwen/Qwen2.5-7B-Instruct', local_dir='/models/Qwen2.5-7B-Instruct') |
| 60 | + print('Download complete') |
| 61 | + " |
| 62 | + volumeMounts: |
| 63 | + - name: model-storage |
| 64 | + mountPath: /models |
| 65 | + resources: |
| 66 | + requests: |
| 67 | + cpu: "500m" |
| 68 | + memory: 512Mi |
| 69 | + volumes: |
| 70 | + - name: model-storage |
| 71 | + persistentVolumeClaim: |
| 72 | + claimName: qwen2-7b-model |
| 73 | +EOF |
| 74 | + |
| 75 | +# Watch progress |
| 76 | +kubectl logs -n $NAMESPACE job/qwen2-7b-downloader -f |
| 77 | + |
| 78 | +# Verify completion |
| 79 | +kubectl get job qwen2-7b-downloader -n $NAMESPACE |
| 80 | +``` |
| 81 | + |
| 82 | +## Step 4: Configure pull secret |
| 83 | + |
| 84 | +```bash |
| 85 | +kubectl get secret redhat-pull-secret -n istio-system -o json | \ |
| 86 | + jq 'del(.metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, |
| 87 | + .metadata.annotations, .metadata.labels, .metadata.ownerReferences) | |
| 88 | + .metadata.namespace = "'$NAMESPACE'"' | \ |
| 89 | + kubectl apply -f - |
| 90 | + |
| 91 | +kubectl patch serviceaccount default -n $NAMESPACE \ |
| 92 | + -p '{"imagePullSecrets": [{"name": "redhat-pull-secret"}]}' |
| 93 | +``` |
| 94 | + |
| 95 | +## Step 5: Deploy LLMInferenceService |
| 96 | + |
| 97 | +```bash |
| 98 | +kubectl apply -n $NAMESPACE -f - <<'EOF' |
| 99 | +apiVersion: serving.kserve.io/v1alpha1 |
| 100 | +kind: LLMInferenceService |
| 101 | +metadata: |
| 102 | + name: qwen2-7b |
| 103 | +spec: |
| 104 | + model: |
| 105 | + name: Qwen/Qwen2.5-7B-Instruct |
| 106 | + uri: pvc://qwen2-7b-model/Qwen2.5-7B-Instruct |
| 107 | + replicas: 1 |
| 108 | + router: |
| 109 | + gateway: {} |
| 110 | + route: {} |
| 111 | + scheduler: {} |
| 112 | + template: |
| 113 | + tolerations: |
| 114 | + - key: "nvidia.com/gpu" |
| 115 | + operator: "Equal" |
| 116 | + value: "present" |
| 117 | + effect: "NoSchedule" |
| 118 | + containers: |
| 119 | + - name: main |
| 120 | + resources: |
| 121 | + limits: |
| 122 | + cpu: "4" |
| 123 | + memory: 64Gi |
| 124 | + nvidia.com/gpu: "1" |
| 125 | + requests: |
| 126 | + cpu: "2" |
| 127 | + memory: 32Gi |
| 128 | + nvidia.com/gpu: "1" |
| 129 | + livenessProbe: |
| 130 | + httpGet: |
| 131 | + path: /health |
| 132 | + port: 8000 |
| 133 | + scheme: HTTPS |
| 134 | + initialDelaySeconds: 120 |
| 135 | + periodSeconds: 30 |
| 136 | + timeoutSeconds: 30 |
| 137 | + failureThreshold: 5 |
| 138 | +EOF |
| 139 | + |
| 140 | +# Watch until READY=True |
| 141 | +kubectl get llmisvc -n $NAMESPACE -w |
| 142 | +``` |
| 143 | + |
| 144 | +## Step 6: Send inference request |
| 145 | + |
| 146 | +```bash |
| 147 | +SERVICE_URL=$(kubectl get llmisvc qwen2-7b -n $NAMESPACE -o jsonpath='{.status.url}') |
| 148 | +echo "Service URL: $SERVICE_URL" |
| 149 | + |
| 150 | +curl -s -k -X POST "${SERVICE_URL}/v1/chat/completions" \ |
| 151 | + -H "Content-Type: application/json" \ |
| 152 | + -d '{ |
| 153 | + "model": "Qwen/Qwen2.5-7B-Instruct", |
| 154 | + "messages": [{"role": "user", "content": "Hello"}], |
| 155 | + "max_tokens": 20 |
| 156 | + }' | python3 -m json.tool |
| 157 | +``` |
| 158 | + |
| 159 | +## Step 7: Trace request through the networking stack |
| 160 | + |
| 161 | +```bash |
| 162 | +# Gateway (ingress) |
| 163 | +echo "=== Gateway ===" |
| 164 | +kubectl get gateway -n opendatahub inference-gateway |
| 165 | +kubectl logs -n opendatahub \ |
| 166 | + -l gateway.networking.k8s.io/gateway-name=inference-gateway \ |
| 167 | + --tail=10 |
| 168 | + |
| 169 | +# HTTPRoute (created automatically by KServe) |
| 170 | +echo "" |
| 171 | +echo "=== HTTPRoute ===" |
| 172 | +kubectl get httproute -n $NAMESPACE |
| 173 | + |
| 174 | +# Router (routes to scheduler) |
| 175 | +echo "" |
| 176 | +echo "=== Router Logs ===" |
| 177 | +kubectl logs -n $NAMESPACE \ |
| 178 | + -l serving.kserve.io/llminferenceservice=qwen2-7b \ |
| 179 | + -c router --tail=10 |
| 180 | + |
| 181 | +# Scheduler (picks best replica) |
| 182 | +echo "" |
| 183 | +echo "=== Scheduler Logs ===" |
| 184 | +kubectl logs -n $NAMESPACE \ |
| 185 | + -l serving.kserve.io/llminferenceservice=qwen2-7b \ |
| 186 | + -c scheduler --tail=10 |
| 187 | + |
| 188 | +# vLLM (inference) |
| 189 | +echo "" |
| 190 | +echo "=== vLLM Logs ===" |
| 191 | +kubectl logs -n $NAMESPACE \ |
| 192 | + -l serving.kserve.io/llminferenceservice=qwen2-7b,serving.kserve.io/component=model \ |
| 193 | + --tail=10 |
| 194 | +``` |
| 195 | + |
| 196 | +### Request flow |
| 197 | + |
| 198 | +``` |
| 199 | +Client |
| 200 | + | |
| 201 | + v |
| 202 | +Gateway (Istio, opendatahub/inference-gateway, port 80) |
| 203 | + | |
| 204 | + v |
| 205 | +Router -> Scheduler (router-scheduler pod, mTLS) |
| 206 | + | |
| 207 | + v |
| 208 | +vLLM Pod (GPU, serves model from PVC) |
| 209 | + | |
| 210 | + v |
| 211 | +Response back: vLLM -> Scheduler -> Router -> Gateway -> Client |
| 212 | +``` |
0 commit comments