This repository contains all configuration files and deployment manifests for running production-ready Large Language Models on Kubernetes with GPU support and full observability.
- Proxmox VE 7.x or later
- NVIDIA Tesla T4 GPU (or compatible)
clusterctlCLI installed- cluster API managed cluster deployed
kubectlandhelminstalled- Dynatrace tenant (for observability)
git clone https://github.com/isItObservable/K8s-LLM
cd K8s-LLMIf you don't have any Dynatrace tenant , then I suggest to create a trial using the following link : Dynatrace Trial
Once you have your Tenant save the Dynatrace tenant url in the variable DT_TENANT_URL (for example : https://dedededfrf.live.dynatrace.com)
DT_TENANT_URL=<YOUR TENANT Host>
The dynatrace operator will require to have several tokens:
- Token to deploy and configure the various components
- Token to ingest metrics and Traces
One for the operator having the following scope:
- Create ActiveGate tokens
- Read entities
- Read Settings
- Write Settings
- Access problem and event feed, metrics and topology
- Read configuration
- Write configuration
- Paas integration - installer downloader
Save the value of the token . We will use it later to store in a k8S secret
API_TOKEN=<YOUR TOKEN VALUE>Create a Dynatrace token with the following scope:
- Ingest metrics (metrics.ingest)
- Ingest logs (logs.ingest)
- Ingest events (events.ingest)
- Ingest OpenTelemetry
- Read metrics
DATA_INGEST_TOKEN=<YOUR TOKEN VALUE>├── cluster-api/ # Cluster provisioning with Proxmox
│ ├── capi_llm_cluster.yaml # cluster configuration
├── gpu-operator/ # NVIDIA GPU Operator setup
│ ├── values.yaml # Helm values
│ └── time-slicing.yaml # GPU sharing configuration
├── llm-serving/ # LLM deployment configurations
│ ├── llm.yaml/ # ollama serving framework
├── observability/ # Monitoring and logging
│ ├── otel-collector/ # OpenTelemetry Collector
│ ├── dynatrace/ # Dynatrace integration
# Initialize Cluster API with Proxmox provider
clusterctl init --infrastructure proxmox
# Create the cluster
kubectl apply -f cluster-api/capi_llm_cluster.yaml
# Wait for cluster to be ready
kubectl get clusters -w
# Get kubeconfig
cp ~/.kube/config ~/.kube/config-bck
clusterctl get kubeconfig observable-llm > observable-llm.kubeconfig
export KUBECONFIG=~/.kube/config:./observable-llm.kubeconfig
kubectl config view
kubectl config view --flatten > one-config.yaml
mv one-config.yaml ~/.kube/config
In order to get the nodes ready, you need a CNI installed change kubectl to point to our new cluster created by the cluster api
kubectl create ns kube-flannel
kubectl label --overwrite ns kube-flannel pod-security.kubernetes.io/enforce=privileged
helm repo add flannel https://flannel-io.github.io/flannel/
helm install flannel --set podCidr="192.168.0.0/16" --namespace kube-flannel flannel/flannel
kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl apply -f - -n kube-system
kubectl rollout restart ds kube-proxy -n kube-system# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# Install with custom values
helm install gpu-operator nvidia/gpu-operator \
-n nvidia-gpu-operator --create-namespace \
-f gpu-operator/values.yaml
# Wait for all pods to be ready
kubectl wait --for=condition=ready pod \
-l app=nvidia-driver-daemonset \
-n nvidia-gpu-operator --timeout=300s# Apply time-slicing configuration
kubectl apply -f gpu-operator/time-slicing.yaml
# Label GPU nodes
kubectl label node gpu-worker-0 \
nvidia.com/device-plugin.config=time-slicingkubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml
# Step 2: Inference Extension CRDs (separate repo)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/1.3.0/manifests.yaml
helm upgrade -i --create-namespace \
--namespace kgateway-system \
--version v2.1.2 kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/ kgateway-crds
helm upgrade -i -n kgateway-system kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgatewaxxy \
--set inferenceExtension.enabled=true \
--version v2.1.2install the infernece pool
helm install ollama-pool \
--namespace llm-inference \
--set inferencePool.modelServers.matchLabels.app=ollama \
--set inferencePool.modelServers.targetPort=11434 \
--set provider.name=kgateway \
--version v1.3.0 \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool# Deploy vLLM with DeepSeek Coder
kubectl apply -f llm-serving/llm.yamlOnce the deployment is finished , let's load 2 models :
kubectl exec -n llm-inference deploy/ollama -- ollama pull llama3.1:8b
kubectl exec -n llm-inference deploy/ollama -- ollama pull deepseek-r1:7b#### Deploy the cert-manager
echo "Deploying Cert Manager ( for OpenTelemetry Operator)"
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.10.0/cert-manager.yaml
# Wait for pod webhook started
kubectl wait pod -l app.kubernetes.io/component=webhook -n cert-manager --for=condition=Ready --timeout=2m
# Deploy the opentelemetry operator
sleep 10
echo "Deploying the OpenTelemetry Operator"
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yamlhelm upgrade dynatrace-operator oci://public.ecr.aws/dynatrace/dynatrace-operator \
--version 1.7.0 \
--create-namespace --namespace dynatrace \
--install \
--atomic
kubectl -n dynatrace wait pod --for=condition=ready --selector=app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/component=webhook --timeout=300s
sed -i '' "s,TENANTURL_TOREPLACE,$DT_TENANT_URL," observability/dynatrace/dynakube.yaml
kubectl -n dynatrace create secret generic dynakube --from-literal="apiToken=$API_TOKEN" --from-literal="dataIngestToken=$DATA_INGEST_TOKEN"kubectl create secret generic dynatrace --from-literal=dynatrace_oltp_url="$DT_TENANT_URL" --from-literal=dt_api_token="$DATA_INGEST_TOKEN"
kubectl apply -f observability/otel-collector/rbac.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_ds.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_statefulset.yamlEnable debug logs for troubleshooting:
# Set log level to debug
kubectl set env deployment/gpu-operator \
-n nvidia-gpu-operator LOG_LEVEL=debug
# View logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-operator -f# Check GPU resources on nodes
kubectl describe node gpu-worker-0 | grep -A5 "Allocatable"
# Verify NVIDIA driver installation
kubectl exec -it -n nvidia-gpu-operator \
daemonset/nvidia-driver-daemonset -- nvidia-smi
# Check DCGM metrics
kubectl exec -n nvidia-gpu-operator \
deploy/nvidia-dcgm-exporter -- curl -s localhost:9400/metrics | head -50| Model | VRAM (Q4) | Use Case | Performance |
|---|---|---|---|
| DeepSeek Coder 6.7B | 3.14 GB | Code generation | 15-20 tok/s |
| Llama 3.1 8B | 4 GB | Strong reasoning | 12-18 tok/s |


