See also the documentation index and ARCHITECTURE.md.
This document outlines the steps required to deploy ChaosLabs on a Kubernetes cluster. It covers setting up namespaces, deploying the controller, agent, and dashboard components, as well as verifying and troubleshooting your deployment.
- Prerequisites
- Overview of Components
- Deployment Instructions
- Exposing Services
- Verification & Troubleshooting
- Observability & Scaling
- Cleanup
- Additional Resources
- A working Kubernetes cluster (local or cloud-based, e.g., Minikube, Docker Desktop Kubernetes, GKE, EKS, etc.)
- kubectl installed and configured for your cluster.
- Access to your container images. Ensure that your images (for the controller, agent, and dashboard) are pushed to a registry accessible by your cluster.
- (Optional) Prometheus, Grafana, and an OTLP-compatible collector or tracing backend (OpenTelemetry Collector, Jaeger v2, Tempo, etc.).
Controller:
- Receives experiment requests via HTTP endpoints.
- Schedules experiments (immediately or at a future time) and dispatches them to one or more agents.
- Exposes Prometheus metrics and OpenTelemetry traces via OTLP/HTTP (
OTEL_EXPORTER_OTLP_ENDPOINTon the Deployment).
Agent:
- Listens for fault injection commands on its
/injectendpoint. - Implements various fault injection techniques (network latency/loss via
tc, CPU/memory stress usingstress-ng, process kill). - Exposes Prometheus metrics and OTLP traces (same env vars as the controller).
Dashboard:
- Provides a web interface for monitoring experiments in real time.
- Visualizes metrics using Grafana.
It's a good practice to deploy ChaosLabs in its own namespace.
kubectl create namespace chaoslabApply the controller deployment manifest:
kubectl apply -f infrastructure/k8s/controller-deployment.yaml -n chaoslabThis YAML file contains the necessary configuration and Prometheus annotations (e.g., prometheus.io/scrape: "true") to allow for metrics scraping.
Apply the agent deployment manifest:
kubectl apply -f infrastructure/k8s/agent-deployment.yaml -n chaoslabThe agent is configured to run with multiple replicas (for horizontal scaling) and includes necessary privileges for fault injection commands.
Apply the dashboard deployment manifest:
kubectl apply -f infrastructure/k8s/dashboard-deployment.yaml -n chaoslabThis deploys the web UI used to monitor experiments in real time.
To simulate scaling under load, you can deploy an HPA for the agent:
kubectl apply -f infrastructure/k8s/agent-hpa.yaml -n chaoslabThis resource will automatically adjust the number of agent replicas based on CPU utilization (or other configured metrics).
For local testing or external access, you can expose services using kubectl port-forward:
- Controller:
kubectl port-forward deployment/chaos-controller 8080:8080 -n chaoslab- Dashboard:
kubectl port-forward deployment/chaos-dashboard 5000:5000 -n chaoslabAlternatively, you can create Kubernetes Service objects (LoadBalancer or NodePort) if you need external access.
Check the status of your deployments:
kubectl get deployments -n chaoslab
kubectl get pods -n chaoslabEnsure all pods are in the Running state.
If any pod is not running as expected, check its logs:
kubectl logs <pod-name> -n chaoslabFor example, to check the controller logs:
kubectl logs deployment/chaos-controller -n chaoslabAccess the /metrics endpoints for the controller and agent (via port-forward or Service) to ensure Prometheus metrics are exposed:
curl http://localhost:8080/metrics
curl http://localhost:9090/metrics- Image Pull Errors: Ensure your images are correctly tagged and accessible from your container registry.
- Fault Injection Failures: Verify that the agent pods are running in privileged mode if required (check your deployment YAML).
- Service Connectivity:
Confirm that the controller can reach the agent endpoints (verify the
AGENT_ENDPOINTSenvironment variable in the controller). For more detailed troubleshooting, refer to the TROUBLESHOOTING.md document.
- Prometheus & Grafana: With the annotations in place, Prometheus should automatically scrape metrics from the controller and agent. Grafana dashboards (provided in the repository) can be imported to visualize these metrics.
- Distributed tracing:
Both components export traces with OTLP/HTTP. Point
OTEL_EXPORTER_OTLP_ENDPOINTat your OpenTelemetry Collector or compatible backend (see sample env ininfrastructure/k8s/*-deployment.yaml). - Scaling: The Horizontal Pod Autoscaler for the agent will help you test scalability under load. Monitor scaling behavior using:
kubectl get hpa -n chaoslabTo remove all ChaosLabs resources from your cluster:
kubectl delete namespace chaoslab- Kubernetes Official Documentation
- Prometheus Documentation
- Grafana Documentation
- OpenTelemetry Collector
For questions or issues, open a ticket on github.com/fraware/chaoslabs/issues.