Skip to content

isItObservable/K8s-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Is it Observable

Is It observable Logo

Episode: Hosting Local LLMs in Kubernetes

This repository contains all configuration files and deployment manifests for running production-ready Large Language Models on Kubernetes with GPU support and full observability.

Prerequisites

  • Proxmox VE 7.x or later
  • NVIDIA Tesla T4 GPU (or compatible)
  • clusterctl CLI installed
  • cluster API managed cluster deployed
  • kubectl and helm installed
  • Dynatrace tenant (for observability)

Clone Github repo

git clone  https://github.com/isItObservable/K8s-LLM
cd K8s-LLM

Getting started

Dynatrace Tenant

1. Dynatrace Tenant - start a trial

If you don't have any Dynatrace tenant , then I suggest to create a trial using the following link : Dynatrace Trial Once you have your Tenant save the Dynatrace tenant url in the variable DT_TENANT_URL (for example : https://dedededfrf.live.dynatrace.com)

DT_TENANT_URL=<YOUR TENANT Host>
2. Create the Dynatrace API Tokens

The dynatrace operator will require to have several tokens:

  • Token to deploy and configure the various components
  • Token to ingest metrics and Traces
Operator Token

One for the operator having the following scope:

  • Create ActiveGate tokens
  • Read entities
  • Read Settings
  • Write Settings
  • Access problem and event feed, metrics and topology
  • Read configuration
  • Write configuration
  • Paas integration - installer downloader

operator token

Save the value of the token . We will use it later to store in a k8S secret

API_TOKEN=<YOUR TOKEN VALUE>
Ingest data token

Create a Dynatrace token with the following scope:

  • Ingest metrics (metrics.ingest)
  • Ingest logs (logs.ingest)
  • Ingest events (events.ingest)
  • Ingest OpenTelemetry
  • Read metrics

data token

Save the value of the token . We will use it later to store in a k8S secret
DATA_INGEST_TOKEN=<YOUR TOKEN VALUE>

Repository Structure

├── cluster-api/           # Cluster provisioning with Proxmox
│   ├── capi_llm_cluster.yaml # cluster configuration
├── gpu-operator/          # NVIDIA GPU Operator setup
│   ├── values.yaml        # Helm values
│   └── time-slicing.yaml  # GPU sharing configuration
├── llm-serving/           # LLM deployment configurations
│   ├── llm.yaml/              # ollama serving framework
├── observability/         # Monitoring and logging
│   ├── otel-collector/    # OpenTelemetry Collector
│   ├── dynatrace/         # Dynatrace integration

Quick Start

1. Provision the Cluster

# Initialize Cluster API with Proxmox provider
clusterctl init --infrastructure proxmox

# Create the cluster
kubectl apply -f cluster-api/capi_llm_cluster.yaml

# Wait for cluster to be ready
kubectl get clusters -w

# Get kubeconfig
cp ~/.kube/config ~/.kube/config-bck
clusterctl get kubeconfig observable-llm > observable-llm.kubeconfig
export KUBECONFIG=~/.kube/config:./observable-llm.kubeconfig
kubectl config view
kubectl config view --flatten > one-config.yaml
mv one-config.yaml ~/.kube/config

Install CNI

In order to get the nodes ready, you need a CNI installed change kubectl to point to our new cluster created by the cluster api

kubectl create ns kube-flannel
kubectl label --overwrite ns kube-flannel pod-security.kubernetes.io/enforce=privileged

helm repo add flannel https://flannel-io.github.io/flannel/
helm install flannel --set podCidr="192.168.0.0/16" --namespace kube-flannel flannel/flannel


kubectl get configmap kube-proxy -n kube-system -o yaml | \
sed -e "s/strictARP: false/strictARP: true/" | \
kubectl apply -f - -n kube-system

kubectl rollout restart ds kube-proxy -n kube-system

2. Install GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# Install with custom values
helm install gpu-operator nvidia/gpu-operator \
  -n nvidia-gpu-operator --create-namespace \
  -f gpu-operator/values.yaml

# Wait for all pods to be ready
kubectl wait --for=condition=ready pod \
  -l app=nvidia-driver-daemonset \
  -n nvidia-gpu-operator --timeout=300s

3. Configure GPU Time-Slicing

# Apply time-slicing configuration
kubectl apply -f gpu-operator/time-slicing.yaml

# Label GPU nodes
kubectl label node gpu-worker-0 \
  nvidia.com/device-plugin.config=time-slicing

4. Install the Gateway API , GatewayAPI Inference extension and Kgateway

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/standard-install.yaml

# Step 2: Inference Extension CRDs (separate repo)
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/1.3.0/manifests.yaml


helm upgrade -i --create-namespace \
  --namespace kgateway-system \
  --version v2.1.2 kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/ kgateway-crds 

helm upgrade -i -n kgateway-system kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgatewaxxy \
--set inferenceExtension.enabled=true \
--version v2.1.2

install the infernece pool

helm install ollama-pool \
  --namespace llm-inference \
  --set inferencePool.modelServers.matchLabels.app=ollama \
  --set inferencePool.modelServers.targetPort=11434 \
  --set provider.name=kgateway \
  --version v1.3.0 \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

5. Deploy LLM Serving

# Deploy vLLM with DeepSeek Coder
kubectl apply -f llm-serving/llm.yaml

Once the deployment is finished , let's load 2 models :

kubectl exec -n llm-inference deploy/ollama -- ollama pull llama3.1:8b
kubectl exec -n llm-inference deploy/ollama -- ollama pull deepseek-r1:7b

5. Setup Observability

#### Deploy the cert-manager
echo "Deploying Cert Manager ( for OpenTelemetry Operator)"
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.10.0/cert-manager.yaml
# Wait for pod webhook started
kubectl wait pod -l app.kubernetes.io/component=webhook -n cert-manager --for=condition=Ready --timeout=2m
# Deploy the opentelemetry operator
sleep 10
echo "Deploying the OpenTelemetry Operator"
kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Deploy the Dynatrace Operator

helm upgrade dynatrace-operator oci://public.ecr.aws/dynatrace/dynatrace-operator \
--version 1.7.0 \
--create-namespace --namespace dynatrace \
--install \
--atomic
kubectl -n dynatrace wait pod --for=condition=ready --selector=app.kubernetes.io/name=dynatrace-operator,app.kubernetes.io/component=webhook --timeout=300s
sed -i '' "s,TENANTURL_TOREPLACE,$DT_TENANT_URL," observability/dynatrace/dynakube.yaml
kubectl -n dynatrace create secret generic dynakube --from-literal="apiToken=$API_TOKEN" --from-literal="dataIngestToken=$DATA_INGEST_TOKEN"

Deploy OpenTelemetry Collector

kubectl create secret generic dynatrace  --from-literal=dynatrace_oltp_url="$DT_TENANT_URL"     --from-literal=dt_api_token="$DATA_INGEST_TOKEN"
kubectl apply -f observability/otel-collector/rbac.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_ds.yaml
kubectl apply -f observability/otel-collector/openTelemetry-manifest_statefulset.yaml

GPU Operator Logging

Enable debug logs for troubleshooting:

# Set log level to debug
kubectl set env deployment/gpu-operator \
  -n nvidia-gpu-operator LOG_LEVEL=debug

# View logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-operator -f

Verification Commands

# Check GPU resources on nodes
kubectl describe node gpu-worker-0 | grep -A5 "Allocatable"

# Verify NVIDIA driver installation
kubectl exec -it -n nvidia-gpu-operator \
  daemonset/nvidia-driver-daemonset -- nvidia-smi

# Check DCGM metrics
kubectl exec -n nvidia-gpu-operator \
  deploy/nvidia-dcgm-exporter -- curl -s localhost:9400/metrics | head -50

Recommended Models for Tesla T4

Model VRAM (Q4) Use Case Performance
DeepSeek Coder 6.7B 3.14 GB Code generation 15-20 tok/s
Llama 3.1 8B 4 GB Strong reasoning 12-18 tok/s

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors