| theme | default | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| background | #f7f7f7 | ||||||||
| title | LLMOps Platform Workshop | ||||||||
| info | Governing AI in Difficult Times | ||||||||
| class | text-center | ||||||||
| highlighter | shiki | ||||||||
| lineNumbers | false | ||||||||
| fonts |
|
||||||||
| drawings |
|
||||||||
| transition | slide-left | ||||||||
| mdc | true | ||||||||
| css | unocss | ||||||||
| favicon | /favicon.ico | ||||||||
| defaults |
|
CEO & Cofounder @ Resizes
CNCF Kubestronaut
GitHub: github.com/guillermotti
LinkedIn: linkedin.com/in/guillermotti
CTO & Cofounder @ Resizes
CNCF Golden Kubestronaut
GitHub: github.com/kaskol10
LinkedIn: linkedin.com/in/ramiroalvfer
- Most teams rely on third-party LLM providers.
- Costs, policy changes, and availability can shift overnight.
- We need a path to technological sovereignty.
-
Part 1: Deploy and run self-hosted inference locally on Kubernetes using
kind+vLLM. -
Part 2: Create a simple LLMOps Platform on top of EKS.
CLIs:
Machine: 12 GB RAM minimum (16 GB recommended), 20 GB disk free.
Accounts: Hugging Face token (read) + license accepted for meta-llama/Llama-3.2-1B-Instruct.
CLIs:
Machine: 128 MB RAM minimum, 20 MB disk free.
Accounts: workshop IAM user keys for aws / eksctl, provided by Resizes!
All manifests and docs live in the repo:
git clone https://github.com/resizes/asc-workshop.git
cd asc-workshopRepository: github.com/resizes/asc-workshop
flowchart TB
subgraph clients["Clients"]
C["curl · Python · OpenAI client"]
end
subgraph cluster["kind cluster"]
L["vLLM — Llama 3.2 1B (svc/vllm-server)"]
Q["vLLM — Qwen Coder optional (svc/vllm-coder)"]
PVC["PVC — model weights cache"]
TOK["Secret hf-token-secret\nLlama pull only"]
end
HF["Hugging Face"]
C -->|"kubectl port-forward"| L
C -.->|"optional"| Q
TOK --> L
L --> PVC
Q --> PVC
L --> HF
Q --> HF
Running Kubernetes locally is easy:
kind create cluster --config kind/kind-config.yamlConfirm cluster is up & running:
kubectl get nodesExpected: 1 control-plane + 2 workers.
Account: If you do not have one yet, sign up at huggingface.co/join and complete verification if Hugging Face asks for it.
Model license: The vLLM deployment pulls Meta Llama 3.2 1B Instruct from Hugging Face. Open the model card and accept the license / request access if prompted:
meta-llama/Llama-3.2-1B-Instruct
Until access is granted, pulls will fail even with a valid token.
- Open Settings → Access Tokens (avatar → Settings → Access Tokens).
- Click Create new token.
- Name: any label (e.g.
asc-workshop). - Type: choose Classic (simplest for this lab) with permission Read — enough to download models your account can access.
If you use Fine-grained instead, grant Read on repositorymeta-llama/Llama-3.2-1B-Instruct(and any other gated repos you pull). - Click Create, then copy the token — it is shown only once. Keep it until you paste it into the command on the next slide (do not commit it to git).
Resources go in the default namespace. Replace YOUR_TOKEN_HERE with the value you copied (no quotes in the literal):
kubectl create secret generic hf-token-secret \
--from-literal=token=YOUR_TOKEN_HERECheck if secret is created:
kubectl get secret hf-token-secretModels need persistance:
kubectl apply -f kind/pvc.yaml
kubectl get pvcPurpose: persist downloaded model files between pod restarts.
Serving models in Kubernetes is as easy as running one command:
kubectl apply -f kind/deployment.yamlNotes:
- first run downloads model (~2.5 GB)
- can take several minutes
- probes may need larger
initialDelaySecondson slower machines
Check pod status:
kubectl get podThe pod should be accessible:
kubectl apply -f kind/service.yamlIf pod is not running and ready, party doesn't start:
kubectl get pods -w
kubectl logs -l app.kubernetes.io/name=vllm -fReady signal:
READY 1/1- server listening on
:8000
Let's create a tunnel between Kubernetes network and your host:
kubectl port-forward svc/vllm-server 8000:8000Use another terminal for the API calls:
curl http://localhost:8000/v1/models | jq .vLLM is 3 years old, but there is already a legacy API:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"prompt": "Kubernetes is a container orchestration platform that",
"max_tokens": 50,
"temperature": 0.7
}' | jq .The new API schema is more agentic ready:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are an expert DevOps assistant."},
{"role": "user", "content": "What is LLMOps and why is it important?"}
],
"max_tokens": 200,
"temperature": 0.7
}' | jq .Install and run:
python3 -m venv .venv && source .venv/bin/activate
pip install openai --index-url https://pypi.org/simple
cat <<'PY' > demo.py
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{"role": "system", "content": "You are an expert DevOps assistant."},
{"role": "user", "content": "Explain what technological sovereignty in AI means."},
],
max_tokens=300,
)
print(response.choices[0].message.content)
PY
python demo.pyLow temperature (0.1–0.3) for code.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are an expert Python developer. Respond only with clean, working code and a brief explanation."},
{"role": "user", "content": "Write a Python function that retries a failed HTTP request up to 3 times with exponential backoff."}
],
"max_tokens": 400,
"temperature": 0.2
}' | jq -r '.choices[0].message.content'Also great for building infra:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are a Kubernetes expert. Output only valid YAML."},
{"role": "user", "content": "Write a Kubernetes CronJob that runs a curl command to a health endpoint every 5 minutes and logs the response."}
],
"max_tokens": 400,
"temperature": 0.1
}' | jq -r '.choices[0].message.content'Nice to add pull request reviews automatically:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are a senior software engineer doing a code review. Be concise and focus on bugs and improvements."},
{"role": "user", "content": "Review this Python code:\n\ndef get_user(id):\n db = connect_db()\n result = db.query(f'\''SELECT * FROM users WHERE id = {id}'\'')\n return result"}
],
"max_tokens": 300,
"temperature": 0.3
}' | jq -r '.choices[0].message.content'How agents are being created:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are a debugging expert. Think through the problem step by step before giving your answer. Format your response as:\n\nThinking:\n<your reasoning>\n\nAnswer:\n<your conclusion>"},
{"role": "user", "content": "A Kubernetes pod keeps restarting every 2 minutes. The logs show the process exits with code 137. What is causing this and how do I fix it?"}
],
"max_tokens": 500,
"temperature": 0.3
}' | jq -r '.choices[0].message.content'Very interesting for planning:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "system", "content": "You are a senior DevOps architect. When given a problem, first list the steps to solve it, then implement each step."},
{"role": "user", "content": "I need to set up a CI/CD pipeline that: builds a Docker image, runs tests, pushes to a registry, and deploys to Kubernetes. Give me the GitHub Actions workflow."}
],
"max_tokens": 600,
"temperature": 0.2
}' | jq -r '.choices[0].message.content'On a typical local kind cluster, two vLLM pods compete heavily for CPU and memory. Before deploying Qwen, tear down the general-model workload so the coder model can load and serve reliably.
If any terminal still has kubectl port-forward pointing at svc/vllm-server, stop it first (Ctrl+C).
kubectl delete deployment vllm-server
kubectl delete service vllm-server
kubectl delete pvc vllm-modelsDeleting the PVC frees disk and removes the cached Llama weights from that volume; if you deploy the Llama manifest again later, the model will be downloaded again.
If your machine has enough resources, you can skip this step and run Llama and Qwen at the same time (general API on port 8000, coder on 8001), then two port-forwards are needed.
The Llama 3.2 1B model is a great general-purpose model, but for coding tasks a specialized model does significantly better. We deploy Qwen 2.5 Coder 1.5B Instruct — no Hugging Face token required.
In production you often route workloads to different models (general vs code); here we either replace Llama with Qwen to save resources, or run both if you skipped Step 11.
Deploy the PVC, deployment and service:
kubectl apply -f kind/pvc-coder.yaml
kubectl apply -f kind/deployment-coder.yaml
kubectl apply -f kind/service-coder.yamlPod should be up & running:
kubectl get podsIf you removed Llama in Step 11, you should see only the coder pod (names will vary):
NAME READY STATUS RESTARTS AGE
vllm-coder-xxx 1/1 Running 0 2m
If you skipped Step 11 and still run Llama, you will see both vllm-server and vllm-coder pods.
Expose Qwen on a local port (here 8001). Stop any previous port-forward that still targets vllm-server unless you intentionally kept Llama running:
kubectl port-forward svc/vllm-coder 8001:8000Run it against port 8001:
curl http://localhost:8001/v1/models | jq .Coding without burning money is possible:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are an expert Python developer. Write clean, well-tested code."},
{"role": "user", "content": "Write a Python function that parses a Kubernetes resource string like '\''100m'\'' (millicores) or '\''2'\'' (cores) and returns the value in millicores as an integer. Include unit tests using pytest."}
],
"max_tokens": 600,
"temperature": 0.1
}' | jq -r '.choices[0].message.content'Building platforms is possible as well:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a Kubernetes and Helm expert. Output only valid YAML with comments."},
{"role": "user", "content": "Write a Helm values.yaml for a web application deployment with: configurable replicas, resource requests/limits, ingress with TLS, horizontal pod autoscaler, and a PostgreSQL dependency."}
],
"max_tokens": 700,
"temperature": 0.1
}' | jq -r '.choices[0].message.content'Kill the bugs:
curl http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a debugging expert. Identify all bugs, explain each one, and provide the fixed code."},
{"role": "user", "content": "Find the bugs in this Python code:\n\nimport threading\n\nresults = []\n\ndef fetch(url):\n import urllib.request\n data = urllib.request.urlopen(url).read()\n results.append(data)\n\nurls = ['\''http://example.com'\''] * 10\nthreads = [threading.Thread(target=fetch, args=(u,)) for u in urls]\nfor t in threads: t.start()\nprint(f'\''Fetched {len(results)} pages'\'')"}
],
"max_tokens": 500,
"temperature": 0.2
}' | jq -r '.choices[0].message.content'Only works when both vllm-server and vllm-coder are running. Same coding prompt to Llama (8000) vs Qwen (8001):
from openai import OpenAI
general = OpenAI(base_url="http://localhost:8000/v1", api_key="not-required")
coder = OpenAI(base_url="http://localhost:8001/v1", api_key="not-required")
prompt = (
"Write a Python context manager that measures and prints "
"the execution time of a code block."
)Same file / session as Part 1 (prompt, general, coder):
def ask(client, model, label):
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are an expert Python developer. Be concise."},
{"role": "user", "content": prompt},
],
max_tokens=400,
temperature=0.1,
)
print(f"\n{'='*60}")
print(f"Model: {label}")
print('='*60)
print(response.choices[0].message.content)
ask(general, "meta-llama/Llama-3.2-1B-Instruct", "Llama 3.2 1B (general)")
ask(coder, "Qwen/Qwen2.5-Coder-1.5B-Instruct", "Qwen 2.5 Coder 1.5B (specialized)")Takeaway: The specialized coder model typically produces more idiomatic code, better error handling, and more complete solutions — even at a similar parameter count. Choosing the right model for the workload is a core LLMOps decision.
Before Part 2, free RAM, CPU, and disk on your laptop and remove the local cluster so kubectl is not still pointed at kind.
- Stop any
kubectl port-forwardtovllm-serverorvllm-coder(Ctrl+Cin those terminals). - From the repository root (same context you used for Part 1):
kubectl delete deployment vllm-server vllm-coder --ignore-not-found
kubectl delete service vllm-server vllm-coder --ignore-not-found
kubectl delete pvc vllm-models vllm-models-coder --ignore-not-found
kubectl delete secret hf-token-secret --ignore-not-found
kind delete cluster --name llmopsCluster name llmops matches kind/kind-config.yaml.
Optional — reclaim Docker image / layer disk after the lab:
docker system prune -a- EKS cluster named with your GitHub username
- GPU managed node group (g4dn.xlarge) for Ollama
- LiteLLM as the single OpenAI-compatible gateway
- Amazon Bedrock (Devstral-2-123B-Instruct-2512 via
mistral.devstral-2-123b) via IRSA - Open WebUI, OpenCode — all through LiteLLM
- kagent in namespace
kagent— agents, kmcp, and UI on-cluster via Helm; this workshop points the default OpenAI-compatible provider at LiteLLM inllmops(BYO OpenAI-compatible) - Hermes Agent (Nous Research) in namespace
hermes— gateway, dashboard, and OpenAI-compatible API (Docker runbook); main chat goes to LiteLLM with aliasdevstral-2-123b-instruct-bedrock(Bedrock inllmops, same pattern as OpenAI-compatiblebase_url); optional Telegram bot via secrethermes-telegram
| What | Secret / auth |
|---|---|
aws / eksctl / cluster |
Lab IAM user keys |
| LiteLLM → Bedrock | IRSA · SA litellm · llmops |
| kagent → LiteLLM | kagent-openai/OPENAI_API_KEY = master key (same as OWUI / OpenCode) |
| Hermes → LiteLLM | hermes-litellm → OPENAI_API_KEY · model devstral-2-123b-instruct-bedrock (09 / 05) |
Hermes /v1 API |
hermes-gateway/API_SERVER_KEY (doc) |
| Hermes Telegram (opt.) | hermes-telegram · TELEGRAM_* (doc) |
flowchart TB
subgraph clients["Clients"]
OWUI["Open WebUI"]
OC["OpenCode"]
KPF["Browser → kagent UI\nkubectl port-forward"]
HPF["Browser → Hermes dashboard\nkubectl port-forward"]
end
subgraph cluster["EKS"]
subgraph ns_llm["namespace llmops"]
LIT["LiteLLM"]
OL["Ollama on GPU nodes"]
end
subgraph ns_kagent["namespace kagent"]
KAG["kagent — controller, UI,\nagents, kmcp, tools"]
end
subgraph ns_hermes["namespace hermes"]
HER["Hermes — gateway + dashboard"]
end
end
subgraph saas["External APIs"]
BR["Bedrock"]
TG["Telegram Bot API"]
end
OWUI --> LIT
OC --> LIT
KPF --> KAG
HPF --> HER
KAG -->|"OpenAI /v1 + Bearer\n(workshop: in-cluster LiteLLM)"| LIT
HER -->|"custom base_url + OPENAI_API_KEY"| LIT
HER <-->|"optional: TELEGRAM_*\n(secret hermes-telegram)"| TG
LIT --> OL
LIT --> BR
- AWS CLI +
WORKSHOP_AWS_ACCOUNT_ID(from STS) +CLUSTER_NAME/eu-west-1 eksctl create cluster -f eks/eksctl-cluster.yaml(GPU node + OIDC)- IRSA — IAM role +
01-litellm-serviceaccount.yaml - Namespace
llmops+ secretlitellm-secrets(LITELLM_MASTER_KEYonly) sedServiceAccount placeholder +kubectl apply -k eks/manifests/— LiteLLM + Ollamaollama pullllama3.2:1b+qwen2.5-coder:1.5b- Verify LiteLLM (port-forward, curl Ollama alias + Bedrock alias)
- Helm Open WebUI →
eks/helm/open-webui-values.yaml - OpenCode on your laptop → LiteLLM via port-forward
- Helm kagent — namespace
kagent:kagent-crdsthenkagentchart - Hermes — namespace
hermes:hermes-litellm+hermes-gatewaysecrets,09–11manifests (LiteLLM/v1+devstral-2-123b-instruct-bedrock); optional Telegram viahermes-telegram
Set AWS access keys and region. Use your GitHub username as cluster name:
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="eu-west-1"
export CLUSTER_NAME="your-github-handle" # EKS cluster nameCheck credentials and capture the AWS account ID used for IAM and ARNs:
export WORKSHOP_AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
echo "WORKSHOP_AWS_ACCOUNT_ID=$WORKSHOP_AWS_ACCOUNT_ID"
aws sts get-caller-identityFrom the repository root, patch the cluster name and create the cluster:
sed -i '' "s/REPLACE_GITHUB_USERNAME/${CLUSTER_NAME}/g" eks/eksctl-cluster.yaml # macOS
# sed -i "s/REPLACE_GITHUB_USERNAME/${CLUSTER_NAME}/g" eks/eksctl-cluster.yaml # Linux
eksctl create cluster -f eks/eksctl-cluster.yamlVerify nodes and GPU allocatable (g4dn.xlarge, label workload: gpu). If nvidia.com/gpu is missing, install the NVIDIA device plugin for EKS and re-check.
kubectl get nodes -o wide
kubectl get nodes -o json | jq '.items[].status.allocatable | "nvidia: " + (.["nvidia.com/gpu"] // "none")'Cluster template enables EBS CSI; manifests set default gp3 StorageClass for ollama-data.
Policy WorkshopBedrockInvoke: arn:aws:iam::${WORKSHOP_AWS_ACCOUNT_ID}:policy/WorkshopBedrockInvoke (must exist in the same account as WORKSHOP_AWS_ACCOUNT_ID from Step 1).
Resolve OIDC (after cluster exists). Re-export WORKSHOP_AWS_ACCOUNT_ID if you opened a new shell:
export WORKSHOP_AWS_ACCOUNT_ID="${WORKSHOP_AWS_ACCOUNT_ID:-$(aws sts get-caller-identity --query Account --output text)}"
export OIDC_ISSUER="$(aws eks describe-cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" \
--query "cluster.identity.oidc.issuer" --output text)"
export OIDC_ID="$(basename "$OIDC_ISSUER")"Write /tmp/litellm-trust-policy.json — Federated OIDC + system:serviceaccount:llmops:litellm:
cat > /tmp/litellm-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${WORKSHOP_AWS_ACCOUNT_ID}:oidc-provider/oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
"oidc.eks.${AWS_REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:llmops:litellm"
}
}
}]
}
EOFCreate role + attach policy:
aws iam create-role \
--role-name litellm-bedrock-eu-west-1 \
--assume-role-policy-document file:///tmp/litellm-trust-policy.json \
--description "IRSA for LiteLLM (llmops/litellm)"
aws iam attach-role-policy \
--role-name litellm-bedrock-eu-west-1 \
--policy-arn arn:aws:iam::${WORKSHOP_AWS_ACCOUNT_ID}:policy/WorkshopBedrockInvokeIf create-role fails with EntityAlreadyExists:
aws iam update-assume-role-policy \
--role-name litellm-bedrock-eu-west-1 \
--policy-document file:///tmp/litellm-trust-policy.jsonProxy token only (Open WebUI / OpenCode / curl). Do not put AWS keys in this secret.
kubectl create namespace llmops --dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic litellm-secrets \
-n llmops \
--from-literal=LITELLM_MASTER_KEY="$(openssl rand -hex 24)"Optional Ollama Cloud (uncomment the block in 05-litellm-configmap.yaml if you use it):
kubectl create secret generic ollama-cloud \
-n llmops \
--from-literal=OLLAMA_API_KEY="YOUR_OLLAMA_CLOUD_KEY"Substitute REPLACE_AWS_ACCOUNT_ID in the ServiceAccount with WORKSHOP_AWS_ACCOUNT_ID (Step 1), then apply:
export WORKSHOP_AWS_ACCOUNT_ID="${WORKSHOP_AWS_ACCOUNT_ID:-$(aws sts get-caller-identity --query Account --output text)}"
sed -i '' "s/REPLACE_AWS_ACCOUNT_ID/${WORKSHOP_AWS_ACCOUNT_ID}/g" eks/manifests/01-litellm-serviceaccount.yaml # macOS
# sed -i "s/REPLACE_AWS_ACCOUNT_ID/${WORKSHOP_AWS_ACCOUNT_ID}/g" eks/manifests/01-litellm-serviceaccount.yaml # Linux
kubectl apply -k eks/manifests/kubectl -n llmops rollout status deploy/ollama --timeout=600s
kubectl -n llmops rollout status deploy/litellm --timeout=300sAfter ConfigMap edits:
kubectl -n llmops rollout restart deploy/litellmLet's use same models than before:
kubectl -n llmops exec deploy/ollama -- ollama pull llama3.2:1b
kubectl -n llmops exec deploy/ollama -- ollama pull qwen2.5-coder:1.5bCheck models:
kubectl -n llmops exec deploy/ollama -- ollama listTerminal 1 — port-forward:
kubectl -n llmops port-forward svc/litellm 4000:4000Terminal 2 — load key and test Ollama via LiteLLM, then Bedrock (IRSA):
export LITELLM_MASTER_KEY="$(kubectl -n llmops get secret litellm-secrets -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)"curl -s http://127.0.0.1:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"qwen-coder-local","messages":[{"role":"user","content":"Say hello."}]}' | jq .curl -s http://127.0.0.1:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"devstral-2-123b-instruct-bedrock","messages":[{"role":"user","content":"Ping"}]}' | jq .The open source chat client:
helm repo add open-webui https://helm.openwebui.com/
helm repo update
helm upgrade --install open-webui open-webui/open-webui \
-n llmops \
-f eks/helm/open-webui-values.yamlLet's use it:
kubectl -n llmops port-forward svc/open-webui 8080:80Prerequisites: LiteLLM running (Step 7 port-forward). Full walkthrough: OpenCode Quickstart (LiteLLM).
kubectl -n llmops port-forward svc/litellm 4000:4000If you don't have OpenCode installed:
curl -fsSL https://opencode.ai/install | bash
# or: npm install -g opencode-ai OR brew install sst/tap/opencode
opencode --versionmkdir -p ~/.config/opencodeCreate ~/.config/opencode/opencode.json (global). baseURL must match the port-forward (http://127.0.0.1:4000/v1). Keys under models must match model_name in eks/manifests/05-litellm-configmap.yaml (workshop aliases):
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"litellm": {
"npm": "@ai-sdk/openai-compatible",
"name": "LiteLLM",
"options": { "baseURL": "http://127.0.0.1:4000/v1" },
"models": {
"llama-3-2-1b-local": { "name": "Llama 3.2 1B (Ollama)" },
"qwen-coder-local": { "name": "Qwen 2.5 Coder (Ollama)" },
"devstral-2-123b-instruct-bedrock": { "name": "Devstral 2 123B (Bedrock)" }
}
}
}
}Let's run it:
export LITELLM_MASTER_KEY="$(kubectl -n llmops get secret litellm-secrets -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)"
echo $LITELLM_MASTER_KEY # Copy the result
opencodeIn OpenCode: /connect → provider LiteLLM (must match "name" in opencode.json) → API key = LITELLM_MASTER_KEY. Then /models and pick an alias.
kagent runs the controller, UI, bundled agents, kmcp, and (by default) a bundled PostgreSQL for evaluation. Install it in its own namespace so it stays isolated from llmops.
Prerequisite: LiteLLM is already running in llmops (Steps 5–7). This lab treats LiteLLM as an OpenAI-compatible upstream: set providers.openAI.config.baseUrl to the in-cluster LiteLLM /v1 endpoint and use LITELLM_MASTER_KEY as the bearer token (same pattern as BYO OpenAI-compatible).
General Helm install: Installing kagent — Using Helm.
Install the CRDs chart once per cluster (release name kagent-crds). Skip this if you already installed it.
helm install kagent-crds oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds \
--namespace kagent \
--create-namespaceThe chart defaults to providers.openAI.apiKeySecretRef: kagent-openai and apiKeySecretKey: OPENAI_API_KEY. Reuse the workshop LITELLM_MASTER_KEY (from litellm-secrets in llmops) as that API key so kagent authenticates to LiteLLM like Open WebUI / OpenCode.
export LITELLM_MASTER_KEY="$(kubectl -n llmops get secret litellm-secrets -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)"
kubectl create secret generic kagent-openai -n kagent \
--from-literal=OPENAI_API_KEY="$LITELLM_MASTER_KEY" \
--dry-run=client -o yaml | kubectl apply -f -To use another secret name or key, set providers.openAI.apiKeySecretRef / providers.openAI.apiKeySecretKey (Helm values).
Install or upgrade the kagent chart. providers.openAI.model must be a model_name from eks/manifests/05-litellm-configmap.yaml (e.g. qwen-coder-local, llama-3-2-1b-local, devstral-2-123b-instruct-bedrock). baseUrl must include the /v1 suffix LiteLLM exposes.
helm upgrade --install kagent oci://ghcr.io/kagent-dev/kagent/helm/kagent \
--namespace kagent \
-f eks/helm/kagent-values.yamlWait until workloads are ready:
kubectl -n kagent get pods
kubectl -n kagent wait --for=condition=Available deployment --all --timeout=600sThe chart exposes svc/kagent-ui on port 8080. If your laptop already uses 8080 for Open WebUI, forward a different local port:
kubectl port-forward -n kagent svc/kagent-ui 8082:8080Then open http://localhost:8082 in the browser.
Hermes Agent is an agent runtime with tools, memory, and an optional OpenAI-compatible HTTP API on port 8642 when API_SERVER_ENABLED=true.
Point the LLM at LiteLLM (base_url) with OPENAI_API_KEY = LITELLM_MASTER_KEY. 09-hermes-configmap.yaml sets model to devstral-2-123b-instruct-bedrock — the LiteLLM model_name alias that routes to Bedrock (defined in 05-litellm-configmap.yaml). Change model in 09 to another alias from 05 (e.g. qwen-coder-local) if you prefer.
Let's deploy it:
kubectl apply -f eks/manifests/08-hermes-namespace.yaml
export LITELLM_MASTER_KEY="$(kubectl -n llmops get secret litellm-secrets -o jsonpath='{.data.LITELLM_MASTER_KEY}' | base64 -d)"
kubectl create secret generic hermes-litellm -n hermes \
--from-literal=LITELLM_MASTER_KEY="$LITELLM_MASTER_KEY" \
--dry-run=client -o yaml | kubectl apply -f -
kubectl create secret generic hermes-gateway -n hermes \
--from-literal=API_SERVER_KEY="$(openssl rand -hex 24)" \
--dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f eks/manifests/09-hermes-configmap.yaml \
-f eks/manifests/10-hermes-deployment.yaml \
-f eks/manifests/11-hermes-service.yamlkubectl -n hermes rollout status deploy/hermes --timeout=600s
kubectl -n hermes port-forward svc/hermes 8643:8642 9120:9119Dashboard: http://localhost:9120. Optional curl against the gateway (Bearer = API_SERVER_KEY from hermes-gateway):
export HERMES_API_KEY="$(kubectl -n hermes get secret hermes-gateway -o jsonpath='{.data.API_SERVER_KEY}' | base64 -d)"
curl -sS http://127.0.0.1:8643/v1/models -H "Authorization: Bearer $HERMES_API_KEY" | jq .Full guide: Telegram | Hermes Agent (groups, webhooks, voice, /topic, allowlists, troubleshooting). On EKS, Hermes reaches Telegram with long polling (outbound); no public URL required unless you switch to webhook mode.
- Open @BotFather →
/newbot→ pick a display name and a…_botusername → copy the API token. If it leaks:/revokein BotFather, then update the Kubernetes secret. - Your numeric Telegram user ID (not
@username): message @userinfobot or @get_id_bot. Hermes usesTELEGRAM_ALLOWED_USERSso only listed users can use the bot (security).
The hermes Deployment injects TELEGRAM_BOT_TOKEN and TELEGRAM_ALLOWED_USERS from secret hermes-telegram via secretKeyRef (optional: true — Hermes still starts if you skip Telegram). Keys must match exactly (comma-separated user IDs in TELEGRAM_ALLOWED_USERS).
kubectl create secret generic hermes-telegram -n hermes \
--from-literal=TELEGRAM_BOT_TOKEN='REPLACE_WITH_BOTFATHER_TOKEN' \
--from-literal=TELEGRAM_ALLOWED_USERS='REPLACE_WITH_YOUR_NUMERIC_ID' \
--dry-run=client -o yaml | kubectl apply -f -
kubectl -n hermes rollout restart deploy/hermes
kubectl -n hermes rollout status deploy/hermes --timeout=600sVerify: DM your bot in Telegram; replies should appear within seconds on a normal EKS node pool.
Webhooks: for serverless / sleep-on-idle platforms, use TELEGRAM_WEBHOOK_URL + TELEGRAM_WEBHOOK_SECRET instead of polling — webhook mode.
Groups: if DM works but not groups → Privacy mode: BotFather Bot Settings → Group Privacy → Turn off, then remove and re-add the bot; or make the bot group admin. See works in DMs but not groups and TELEGRAM_GROUP_ALLOWED_CHATS / TELEGRAM_GROUP_ALLOWED_USERS.
Optional: BotFather /setdescription, /setabouttext, /setcommands — Customize your bot.
Same commands as Cleanup — Part 1 (before Part 2). Stop kubectl port-forward first.
kubectl delete deployment vllm-server vllm-coder --ignore-not-found
kubectl delete service vllm-server vllm-coder --ignore-not-found
kubectl delete pvc vllm-models vllm-models-coder --ignore-not-found
kubectl delete secret hf-token-secret --ignore-not-found
kind delete cluster --name llmopsOptional disk cleanup:
docker system prune -aFrom the repository root, with the same credentials and CLUSTER_NAME as Step 1:
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="eu-west-1"
export CLUSTER_NAME="your-github-handle"
aws sts get-caller-identity
helm uninstall open-webui -n llmops 2>/dev/null || true
helm uninstall kagent -n kagent 2>/dev/null || true
helm uninstall kagent-crds -n kagent 2>/dev/null || true
kubectl delete namespace kagent --wait=true 2>/dev/null || true
kubectl delete namespace hermes --wait=true 2>/dev/null || true
kubectl delete namespace llmops --wait=true 2>/dev/null || true
eksctl delete cluster --name "$CLUSTER_NAME" --region "$AWS_REGION" --waitaws eks list-clusters --region "$AWS_REGION" --query 'clusters' --output textThe IAM role from Step 3 can stay; after a new cluster exists, run Step 3 again so the trust policy matches the new OIDC issuer.
Questions?






