Skip to content

victoriacheng15/observability-hub

Repository files navigation

Observability Hub

Observability Hub is a self-hosted platform engineering lab built with Kubernetes, GitOps, Terraform, OpenTelemetry, Prometheus, Grafana, Cilium/eBPF, PostgreSQL, and Go services.

It proves an end-to-end platform ownership loop: declarative infrastructure runs host and cluster services, telemetry exposes behavior, operators and agents diagnose issues, bounded remediation applies fixes, and ADRs/RCAs preserve operational memory.

Project Portal | Full Documentation


Case Studies

Case Study Problem How it was diagnosed Result
Rust Telemetry Summarization Processor Raw logs and metrics returned too much data for agent workflows Added a Rust obs-processor and validated the language choice with ADR 023 benchmark evidence Reduced token load while preserving investigation pivots
Worker Ingestion Blocked from MongoDB Atlas Scheduled ingestion could not reach Atlas Used worker logs and Cilium policy review to identify blocked egress Added Atlas egress policy and documented prevention
Loki Gateway DNS Timeout Grafana and agents could not reliably query logs Traced the request path through gateway DNS resolution and Loki service routing Fixed resolver config and added operational checks
SSH Lockout via Cilium IPAM Collision Host access failed after networking drift Correlated Cilium/IPAM state, pod readiness, and host reachability Restored access and documented recovery path

Architecture

The main system flow starts from declarative source, runs through host and cluster runtimes, emits telemetry, drives diagnosis, and feeds remediations and lessons back into source control.

Path Use case Flow
Platform reconciliation Keep host and cluster state aligned with Git Git/Terraform/Kustomize/systemd -> Argo CD/Proxy -> Kubernetes/systemd runtime
Telemetry pipeline Capture behavior across services and infrastructure Go services/Kubernetes/Cilium -> OpenTelemetry/Prometheus/Loki/Tempo/Hubble -> Grafana/MCP
Agent diagnosis Let operators query and repair live systems through bounded tools MCP Hub -> telemetry/pod/network providers -> diagnosis or controlled remediation
Batch analytics Convert runtime metrics and ingestion inputs into stored operational insight Worker CronJobs -> Prometheus/Postgres/OpenBao -> analytics and ingestion records
Operational memory Preserve the reasoning behind decisions and failures Workflows/incidents -> ADRs/RCAs/notes -> future source changes
flowchart TB
    Source["Source of Truth<br/>Git, Terraform, Kustomize"]
    Runtime["Runtime<br/>Kubernetes, host services, databases"]
    Signals["Signals<br/>OTel, Prometheus"]
    Decisions["Decisions<br/>Grafana, MCP tools, workflows"]
    Actions["Actions<br/>GitOps sync, pod repair, service restart"]
    Memory["Memory<br/>ADRs, RCAs, notes, workflows"]

    Source --> Runtime
    Runtime --> Signals
    Signals --> Decisions
    Decisions --> Actions
    Actions --> Source
    Decisions --> Memory
    Memory --> Source
Loading

Tech Stack

Layer Tools
Language Go, Rust
Infrastructure Kubernetes, Terraform, Helm, Docker, Argo CD
Data stores PostgreSQL, Azure Blob Storage
Observability OpenTelemetry, Prometheus, Grafana, Cilium
Security Trivy, Tailscale
Testing Go testing package, table-driven tests
CI/CD GitHub Actions, Argo CD

Documentation


Local Setup

cp .env.example .env
make web-build
make proxy-build
make mcp-build

Run checks:

make test
make lint
make lint-configs

Plan infrastructure:

cd tofu
tofu init
tofu plan

About

End-to-end Kubernetes observability system using Terraform and GitOps, with OpenTelemetry, eBPF networking, MCP diagnostics, HA data services, and incident workflows including ADRs and RCAs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors