Is it Observable

Episode: Hosting Local LLMs in Kubernetes

This repository contains all configuration files and deployment manifests for running production-ready Large Language Models on Kubernetes with GPU support and full observability.

Prerequisites

Proxmox VE 7.x or later
NVIDIA Tesla T4 GPU (or compatible)
clusterctl CLI installed
cluster API managed cluster deployed
kubectl and helm installed
Dynatrace tenant (for observability)

Clone Github repo

git clone  https://github.com/isItObservable/K8s-LLM
cd K8s-LLM

Getting started

Dynatrace Tenant

1. Dynatrace Tenant - start a trial

If you don't have any Dynatrace tenant , then I suggest to create a trial using the following link : Dynatrace Trial Once you have your Tenant save the Dynatrace tenant url in the variable DT_TENANT_URL (for example : https://dedededfrf.live.dynatrace.com)

DT_TENANT_URL=<YOUR TENANT Host>

2. Create the Dynatrace API Tokens

The dynatrace operator will require to have several tokens:

Token to deploy and configure the various components
Token to ingest metrics and Traces

Operator Token

One for the operator having the following scope:

Create ActiveGate tokens
Read entities
Read Settings
Write Settings
Access problem and event feed, metrics and topology
Read configuration
Write configuration
Paas integration - installer downloader

Save the value of the token . We will use it later to store in a k8S secret

API_TOKEN=<YOUR TOKEN VALUE>

Ingest data token

Create a Dynatrace token with the following scope:

Ingest metrics (metrics.ingest)
Ingest logs (logs.ingest)
Ingest events (events.ingest)
Ingest OpenTelemetry
Read metrics

Save the value of the token . We will use it later to store in a k8S secret

DATA_INGEST_TOKEN=<YOUR TOKEN VALUE>

Repository Structure

├── cluster-api/           # Cluster provisioning with Proxmox
│   ├── capi_llm_cluster.yaml # cluster configuration
├── gpu-operator/          # NVIDIA GPU Operator setup
│   ├── values.yaml        # Helm values
│   └── time-slicing.yaml  # GPU sharing configuration
├── llm-serving/           # LLM deployment configurations
│   ├── llm.yaml/              # ollama serving framework
├── observability/         # Monitoring and logging
│   ├── otel-collector/    # OpenTelemetry Collector
│   ├── dynatrace/         # Dynatrace integration

Quick Start

1. Provision the Cluster

# Initialize Cluster API with Proxmox provider
clusterctl init --infrastructure proxmox

# Create the cluster
kubectl apply -f cluster-api/capi_llm_cluster.yaml

# Wait for cluster to be ready
kubectl get clusters -w

# Get kubeconfig
cp ~/.kube/config ~/.kube/config-bck
clusterctl get kubeconfig observable-llm > observable-llm.kubeconfig
export KUBECONFIG=~/.kube/config:./observable-llm.kubeconfig
kubectl config view
kubectl config view --flatten > one-config.yaml
mv one-config.yaml ~/.kube/config

Install CNI

In order to get the nodes ready, you need a CNI installed change kubectl to point to our new cluster created by the cluster api

kubectl create ns kube-flannel
kubectl label --overwrite ns kube-flannel pod-security.kubernetes.io/enforce=privileged

helm repo add flannel https://flannel-io.github.io/flannel/
helm install flannel --set podCidr="192.168.0.0/16" --namespace kube-flannel flannel/flannel


kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl apply -f - -n kube-system

kubectl rollout restart ds kube-proxy -n kube-system

2. Install GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# Install with custom values
helm install gpu-operator nvidia/gpu-operator \
  -n nvidia-gpu-operator --create-namespace \
  -f gpu-operator/values.yaml

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod \
  -l app=nvidia-driver-daemonset \
  -n nvidia-gpu-operator --timeout=300s

3. Configure GPU Time-Slicing

# Apply time-slicing configuration
kubectl apply -f gpu-operator/time-slicing.yaml

# Label GPU nodes
kubectl label node gpu-worker-0 \
  nvidia.com/device-plugin.config=time-slicing

4. Install the Gateway API , GatewayAPI Inference extension and Kgateway

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

# Step 2: Inference Extension CRDs (separate repo)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/1.3.0/manifests.yaml


helm upgrade -i --create-namespace \
  --namespace kgateway-system \
  --version v2.1.2 kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/ kgateway-crds 

helm upgrade -i -n kgateway-system kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgatewaxxy \
--set inferenceExtension.enabled=true \
--version v2.1.2

install the infernece pool

helm install ollama-pool \
  --namespace llm-inference \
  --set inferencePool.modelServers.matchLabels.app=ollama \
  --set inferencePool.modelServers.targetPort=11434 \
  --set provider.name=kgateway \
  --version v1.3.0 \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

5. Deploy LLM Serving

# Deploy vLLM with DeepSeek Coder
kubectl apply -f llm-serving/llm.yaml

Once the deployment is finished , let's load 2 models :

kubectl exec -n llm-inference deploy/ollama -- ollama pull llama3.1:8b
kubectl exec -n llm-inference deploy/ollama -- ollama pull deepseek-r1:7b

5. Setup Observability

#### Deploy the cert-manager
echo "Deploying Cert Manager ( for OpenTelemetry Operator)"
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.10.0/cert-manager.yaml
# Wait for pod webhook started
kubectl wait pod -l app.kubernetes.io/component=webhook -n cert-manager --for=condition=Ready --timeout=2m
# Deploy the opentelemetry operator
sleep 10
echo "Deploying the OpenTelemetry Operator"
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Deploy the Dynatrace Operator

helm upgrade dynatrace-operator oci://public.ecr.aws/dynatrace/dynatrace-operator \
--version 1.7.0 \
--create-namespace --namespace dynatrace \
--install \
--atomic
kubectl -n dynatrace wait pod --for=condition=ready --selector=app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/component=webhook --timeout=300s
sed -i '' "s,TENANTURL_TOREPLACE,$DT_TENANT_URL," observability/dynatrace/dynakube.yaml
kubectl -n dynatrace create secret generic dynakube --from-literal="apiToken=$API_TOKEN" --from-literal="dataIngestToken=$DATA_INGEST_TOKEN"

Deploy OpenTelemetry Collector

kubectl create secret generic dynatrace  --from-literal=dynatrace_oltp_url="$DT_TENANT_URL"     --from-literal=dt_api_token="$DATA_INGEST_TOKEN"
kubectl apply -f observability/otel-collector/rbac.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_ds.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_statefulset.yaml

GPU Operator Logging

Enable debug logs for troubleshooting:

# Set log level to debug
kubectl set env deployment/gpu-operator \
  -n nvidia-gpu-operator LOG_LEVEL=debug

# View logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-operator -f

Verification Commands

# Check GPU resources on nodes
kubectl describe node gpu-worker-0 | grep -A5 "Allocatable"

# Verify NVIDIA driver installation
kubectl exec -it -n nvidia-gpu-operator \
  daemonset/nvidia-driver-daemonset -- nvidia-smi

# Check DCGM metrics
kubectl exec -n nvidia-gpu-operator \
  deploy/nvidia-dcgm-exporter -- curl -s localhost:9400/metrics | head -50

Recommended Models for Tesla T4

Model	VRAM (Q4)	Use Case	Performance
DeepSeek Coder 6.7B	3.14 GB	Code generation	15-20 tok/s
Llama 3.1 8B	4 GB	Strong reasoning	12-18 tok/s

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cluster-api		cluster-api
gpu-operator		gpu-operator
image		image
llm-serving		llm-serving
observability		observability
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Is it Observable

Episode: Hosting Local LLMs in Kubernetes

Prerequisites

Clone Github repo

Getting started

Dynatrace Tenant

1. Dynatrace Tenant - start a trial

2. Create the Dynatrace API Tokens

Operator Token

Ingest data token

Repository Structure

Quick Start

1. Provision the Cluster

Install CNI

2. Install GPU Operator

3. Configure GPU Time-Slicing

4. Install the Gateway API , GatewayAPI Inference extension and Kgateway

5. Deploy LLM Serving

5. Setup Observability

Deploy the Dynatrace Operator

Deploy OpenTelemetry Collector

GPU Operator Logging

Verification Commands

Recommended Models for Tesla T4

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Is it Observable

Episode: Hosting Local LLMs in Kubernetes

Prerequisites

Clone Github repo

Getting started

Dynatrace Tenant

1. Dynatrace Tenant - start a trial

2. Create the Dynatrace API Tokens

Operator Token

Ingest data token

Repository Structure

Quick Start

1. Provision the Cluster

Install CNI

2. Install GPU Operator

3. Configure GPU Time-Slicing

4. Install the Gateway API , GatewayAPI Inference extension and Kgateway

5. Deploy LLM Serving

5. Setup Observability

Deploy the Dynatrace Operator

Deploy OpenTelemetry Collector

GPU Operator Logging

Verification Commands

Recommended Models for Tesla T4

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages