Observability Hub is a self-hosted platform engineering lab built with Kubernetes, GitOps, Terraform, OpenTelemetry, Prometheus, Grafana, Cilium/eBPF, PostgreSQL, and Go services.
It proves an end-to-end platform ownership loop: declarative infrastructure runs host and cluster services, telemetry exposes behavior, operators and agents diagnose issues, bounded remediation applies fixes, and ADRs/RCAs preserve operational memory.
Project Portal | Full Documentation
| Case Study | Problem | How it was diagnosed | Result |
|---|---|---|---|
| Rust Telemetry Summarization Processor | Raw logs and metrics returned too much data for agent workflows | Added a Rust obs-processor and validated the language choice with ADR 023 benchmark evidence |
Reduced token load while preserving investigation pivots |
| Worker Ingestion Blocked from MongoDB Atlas | Scheduled ingestion could not reach Atlas | Used worker logs and Cilium policy review to identify blocked egress | Added Atlas egress policy and documented prevention |
| Loki Gateway DNS Timeout | Grafana and agents could not reliably query logs | Traced the request path through gateway DNS resolution and Loki service routing | Fixed resolver config and added operational checks |
| SSH Lockout via Cilium IPAM Collision | Host access failed after networking drift | Correlated Cilium/IPAM state, pod readiness, and host reachability | Restored access and documented recovery path |
The main system flow starts from declarative source, runs through host and cluster runtimes, emits telemetry, drives diagnosis, and feeds remediations and lessons back into source control.
| Path | Use case | Flow |
|---|---|---|
| Platform reconciliation | Keep host and cluster state aligned with Git | Git/Terraform/Kustomize/systemd -> Argo CD/Proxy -> Kubernetes/systemd runtime |
| Telemetry pipeline | Capture behavior across services and infrastructure | Go services/Kubernetes/Cilium -> OpenTelemetry/Prometheus/Loki/Tempo/Hubble -> Grafana/MCP |
| Agent diagnosis | Let operators query and repair live systems through bounded tools | MCP Hub -> telemetry/pod/network providers -> diagnosis or controlled remediation |
| Batch analytics | Convert runtime metrics and ingestion inputs into stored operational insight | Worker CronJobs -> Prometheus/Postgres/OpenBao -> analytics and ingestion records |
| Operational memory | Preserve the reasoning behind decisions and failures | Workflows/incidents -> ADRs/RCAs/notes -> future source changes |
flowchart TB
Source["Source of Truth<br/>Git, Terraform, Kustomize"]
Runtime["Runtime<br/>Kubernetes, host services, databases"]
Signals["Signals<br/>OTel, Prometheus"]
Decisions["Decisions<br/>Grafana, MCP tools, workflows"]
Actions["Actions<br/>GitOps sync, pod repair, service restart"]
Memory["Memory<br/>ADRs, RCAs, notes, workflows"]
Source --> Runtime
Runtime --> Signals
Signals --> Decisions
Decisions --> Actions
Actions --> Source
Decisions --> Memory
Memory --> Source
| Layer | Tools |
|---|---|
| Language | Go, Rust |
| Infrastructure | Kubernetes, Terraform, Helm, Docker, Argo CD |
| Data stores | PostgreSQL, Azure Blob Storage |
| Observability | OpenTelemetry, Prometheus, Grafana, Cilium |
| Security | Trivy, Tailscale |
| Testing | Go testing package, table-driven tests |
| CI/CD | GitHub Actions, Argo CD |
- Architecture
- Ownership Model
- Deployment
- Observability
- Security
- Decisions
- Incidents
- Operations and CI/CD
cp .env.example .env
make web-build
make proxy-build
make mcp-buildRun checks:
make test
make lint
make lint-configsPlan infrastructure:
cd tofu
tofu init
tofu plan