A comprehensive Large Language Model (LLM) deployment featuring vLLM inference, OpenWebUI frontend, and Llama Stack for AI applications. This setup includes model validation using Sigstore for enhanced security and integrity verification.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ OpenWebUI │────│ Llama Stack │────│ vLLM │
│ (Frontend) │ │ (Orchestrator)│ │ (Inference) │
│ Port: 3000 │ │ Port: 8321 │ │ Port: 8000 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
llm.klimlive.de Tavily Search IBM Granite
API Key 3.3-2B Model
+ Sigstore Validation
- Model: Llama 3.2-70B Instruct
- Features:
- High-performance LLM inference
- Sigstore-based model integrity validation
- OpenTelemetry instrumentation
- Automatic model validation on deployment
- Storage: Persistent volume for model cache
- Security: Model validation using Sigstore transparency logs
- Purpose: Web-based chat interface for LLM interaction
- Features:
- Modern chat UI supporting multiple conversations
- Integration with Llama Stack API
- OpenTelemetry tracing enabled
- Persistent data storage
- Access: Available at
llm.klimlive.de
- Purpose: Orchestration layer and API gateway
- Features:
- OpenAI-compatible API endpoints
- Tavily Search API integration for web search capabilities
- Python instrumentation for observability
- Configurable via YAML templates
This deployment implements Sigstore-based model validation for enhanced security:
- Models are automatically validated on pod startup
- Validation is triggered by the
validation.rhtas.redhat.com/ml: "true"label - Init containers verify model integrity before the main workload starts
# Restart deployment to trigger validation
kubectl rollout restart deployment vllm -n llm
# Check validation status
kubectl get pods -n llm
kubectl logs <vllm-pod-name> -c model-validation -n llmA debug container is available for manual model operations:
# Sign a model
kubectl exec -it <debug-pod> -- model_signing sign sigstore /models/...
# Verify a model
kubectl exec -it <debug-pod> -- model_signing verify sigstore /models/...
# View debug container
kubectl get pod -l app=model-validation-debug -n llm- vLLM:
HF_TOKEN: Hugging Face access token (from secret)- Model validation settings in granite-validation.yaml
- Llama Stack:
VLLM_URL: Connection to vLLM serviceTAVILY_SEARCH_API_KEY: Web search integrationCUSTOM_OTEL_TRACE_ENDPOINT: Observability endpoint
- OpenWebUI:
OPENAI_API_BASE_URL: Points to Llama Stack APIENABLE_OTEL: OpenTelemetry integration
- OpenWebUI: 3Gi persistent volume (openebs-cache)
- vLLM: 100Gi persistent volume for model storage (openebs-cache)
- Llama Stack: EmptyDir volumes for temporary storage
- External Access: HTTPRoute via Envoy Gateway at
llm.klimlive.de - Internal Communication:
- OpenWebUI → Llama Stack (port 8321)
- Llama Stack → vLLM (port 8000)
- Load Balancing: ClusterIP services for internal traffic
All components are instrumented with OpenTelemetry:
- Traces: Sent to SigNoz backend in observability namespace
- Metrics: ServiceMonitor for Prometheus scraping (vLLM)
- Health Checks: Kubernetes readiness and liveness probes
The application is deployed via Flux CD GitOps:
# Check deployment status
kubectl get pods -n llm
kubectl get svc -n llm
kubectl get httproute -n llm
# View logs
kubectl logs -f deployment/vllm -n llm
kubectl logs -f deployment/open-webui -n llm
kubectl logs -f deployment/llamastack -n llm- Model Integrity: Sigstore validation ensures model authenticity
- Secret Management: SOPS-encrypted secrets for API keys
- Network Security: Internal service communication only
- Resource Limits: CPU and memory constraints on all components
- Minimal Privileges: Non-root containers where possible
- Model validation failures: Check granite-validation.yaml configuration
- vLLM startup issues: Verify GPU availability and model download
- OpenWebUI connection errors: Check Llama Stack service connectivity
- Search functionality: Verify Tavily API key configuration
# Check model validation operator
kubectl get modelvalidation -n llm
# View vLLM model loading
kubectl logs deployment/vllm -n llm
# Test API connectivity
kubectl exec -it <openwebui-pod> -- curl http://llamastack:8321/health- vLLM: Requires GPU nodes for optimal performance
- Total CPU: ~3 cores
- Total Memory: ~10Gi
- Storage: ~103Gi total persistent storage
- GPU: AMD/Intel GPU support via device plugins
- vLLM Component README - Detailed vLLM configuration and model validation
- Sigstore Model Transparency - Model validation details