A Production-Grade Microservices Platform for Site Reliability Engineering
- Project Overview
- What Was Built
- Architecture
- Project Structure
- Technology Decisions
- Getting Started
- API Endpoints
- Observability
- Deployment
- Project Status
This repository is the application layer of a comprehensive SRE Portfolio project. It demonstrates real-world Site Reliability Engineering practices:
| Principle | Implementation |
|---|---|
| Clean Architecture | Separated concerns into cmd/, internal/, with clear boundaries |
| Observability | Structured logging (Zerolog), distributed tracing (OpenTelemetry), metrics (Prometheus) |
| Reliability | Graceful shutdown, health probes, circuit breakers, rate limiting |
| Security | Distroless containers, non-root execution, minimal attack surface |
| Infrastructure | Kubernetes-native with Helm charts, HPA, PDB |
- Infrastructure (Terraform): sre-platform-infra β Provisions GKE cluster, VPC, and Cloud DNS on GCP
- Infrastructure Setup β Created GKE Autopilot cluster using Terraform
- Networking β Configured VPC, subnets, and firewall rules
- Remote State β Terraform state stored in GCS bucket
- Clean Architecture β Structured codebase following Go best practices
- Microservices β Built
api-serviceandworker-service - Configuration β Environment-based config loading with Viper
- Containerization β Multi-stage Dockerfiles with distroless images
- Local Development β Docker Compose for full-stack testing
- Structured Logging β JSON logs via Zerolog with request correlation
- Distributed Tracing β OpenTelemetry integration with Jaeger
- Metrics β Prometheus endpoints with custom business metrics
- Health Probes β Liveness (
/healthz), Readiness (/ready), Debug (/debug/info)
- Helm Charts β Kubernetes deployment automation
- Rate Limiting β Token bucket algorithm protecting API endpoints
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β KUBERNETES CLUSTER β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββ β
β β api-service βββββΆβ Redis ββββββ worker β β
β β (Gin HTTP) β β (Queue) β β (Consumer) β β
β ββββββββββ¬βββββββββ βββββββββββββββββββ βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββ β
β β Jaeger β βββ OpenTelemetry Traces β
β β (Tracing UI) β β
β βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
sequenceDiagram
participant User
participant API as api-service
participant Redis
participant Worker as worker-service
User->>API: POST /jobs {"payload": "data"}
API->>API: Validate + Generate Request ID
API->>Redis: LPUSH job (with trace context)
API-->>User: 202 Accepted {job_id}
Worker->>Redis: BRPOP (blocking pop)
Redis-->>Worker: Job data
Worker->>Worker: Process job
Worker->>Worker: Log with correlated Request ID
sre-platform-app/
βββ cmd/ # Application entrypoints
β βββ api-service/ # HTTP API server
β β βββ main.go # Bootstraps server, middleware, graceful shutdown
β βββ worker-service/ # Background job processor
β β βββ main.go # Consumes Redis queue, processes jobs
β βββ platform-healthcheck/ # Lightweight healthcheck binary
β βββ main.go # Used in Dockerfile HEALTHCHECK
β
βββ internal/ # Private application code
β βββ api/ # HTTP handlers and middleware
β β βββ server.go # Route definitions (/healthz, /ready, /metrics, etc.)
β β βββ middleware.go # RequestID, RateLimit, Metrics, Logger middleware
β βββ config/ # Configuration loading
β β βββ config.go # Viper-based env/flag config
β βββ logger/ # Structured logging setup
β β βββ logger.go # Zerolog initialization
β βββ metadata/ # Build information
β β βββ metadata.go # Version, CommitSHA, BuildTime (injected at build)
β βββ queue/ # Redis queue abstraction
β β βββ producer.go # Job enqueueing with circuit breaker
β βββ telemetry/ # Observability setup
β β βββ tracing.go # OpenTelemetry tracer initialization
β βββ worker/ # Job processing logic
β βββ consumer.go # Redis consumer with graceful shutdown
β
βββ argocd-app.yaml # ArgoCD Application Manifest (GitOps)
βββ charts/ # Helm charts for Kubernetes deployment
β βββ sre-platform/
β βββ Chart.yaml
β βββ values.yaml
β βββ templates/
β βββ api-deployment.yaml
β βββ api-service.yaml
β βββ api-hpa.yaml
β βββ worker-deployment.yaml
β βββ worker-hpa.yaml
β βββ pdb.yaml
β βββ redis.yaml
β
βββ k8s_legacy/ # Legacy raw Kubernetes manifests (deprecated)
βββ Dockerfile # Multi-stage build for both services
βββ docker-compose.yaml # Local development stack
βββ go.mod / go.sum # Go module dependencies
βββ SRE.txt # Master project plan (6 phases)
| Directory | Purpose | SRE Benefit |
|---|---|---|
cmd/ |
Thin entrypoints only | Easy to understand startup sequence |
internal/ |
Business logic hidden | Prevents accidental external imports |
internal/api/ |
HTTP layer isolated | Can test handlers without full server |
internal/queue/ |
Queue abstraction | Can swap Redis for SQS/Kafka later |
charts/ |
Helm-based deployment | Reproducible, parameterized releases |
- Performance: Compiled, statically typed, low memory footprint
- Concurrency: Goroutines for handling thousands of connections
- Small Binaries: ~10MB final image size
- Cloud Native: First-class Kubernetes, Prometheus, OTel support
- Fast: One of the fastest Go HTTP routers
- Middleware Ecosystem: Easy to add logging, tracing, auth
- Production Proven: Used by companies like Grab, Riot Games
- Zero Allocation: Fastest structured logger for Go
- JSON Output: Machine-parseable for log aggregation
- Context Integration: Easy request ID propagation
- Vendor Neutral: Export to Jaeger, Zipkin, Google Cloud Trace, etc.
- Future Standard: CNCF project, replacing OpenTracing/OpenCensus
- Auto-instrumentation: Middleware for Gin included
- Security: No shell, no package manager, no attack surface
- Size: ~3MB base vs ~5MB Alpine vs ~100MB Debian
- CVE-Free: No OS packages to patch
- Go 1.21+
- Docker & Docker Compose
kubectl(for Kubernetes deployment)helm(for Helm deployment)
# Clone the repository
git clone https://github.com/Sanjeevliv/sre-platform-app.git
cd sre-platform-app
# Start the full stack (API, Worker, Redis, Jaeger)
docker-compose up --build
# In another terminal, test the API
curl http://localhost:8080/healthz
# Output: ok
curl http://localhost:8080/version
# Output: {"version":"dev","commit_sha":"none","build_time":"unknown","go_version":"go1.25"}
# Submit a job
curl -X POST http://localhost:8080/jobs \
-H "Content-Type: application/json" \
-d '{"payload": "Hello SRE World"}'
# Output: {"job_id":"uuid-here","status":"queued"}
# View traces
open http://localhost:16686 # Jaeger UI| Variable | Default | Description |
|---|---|---|
API_PORT |
8080 |
HTTP server port |
REDIS_ADDR |
localhost:6379 |
Redis connection string |
GIN_MODE |
debug |
Gin mode (debug/release) |
OTEL_EXPORTER_OTLP_ENDPOINT |
localhost:4318 |
OpenTelemetry collector |
RATE_LIMIT_RPS |
100 |
Requests per second limit |
RATE_LIMIT_BURST |
200 |
Burst capacity |
| Endpoint | Method | Purpose | Response |
|---|---|---|---|
/ |
GET | Root handler | SRE Platform API Service |
/healthz |
GET | Liveness probe | ok |
/ready |
GET | Readiness probe | ready |
/version |
GET | Build metadata | {"version":"...","commit_sha":"..."} |
/debug/info |
GET | Runtime diagnostics | {"goroutines":5,"memory_alloc":...} |
/metrics |
GET | Prometheus metrics | Prometheus text format |
/jobs |
POST | Submit background job | {"job_id":"...","status":"queued"} |
# Kubernetes uses these probes:
livenessProbe:
httpGet:
path: /healthz # "Am I alive?" - restart if fails
readinessProbe:
httpGet:
path: /ready # "Can I serve traffic?" - remove from LB if fails{
"level": "info",
"request_id": "abc-123",
"method": "POST",
"path": "/jobs",
"status": 202,
"latency_ms": 15,
"message": "request completed"
}- Every request gets a trace ID
- Spans created for HTTP handlers, Redis operations
- View in Jaeger UI at
http://localhost:16686
# Custom business metrics
http_requests_total{method="POST",path="/jobs",status="202"} 150
http_request_duration_seconds_bucket{le="0.1"} 145
- Service Level Indicators (SLIs): Defined via Prometheus recording rules (Availability, Latency).
- Service Level Objectives (SLOs): 99.9% Availability, <300ms Latency (p99).
- Alerting: Multi-window burn rate alerts to protect the Error Budget.
All three pillars share the same request_id:
- Log:
"request_id": "abc-123" - Trace:
trace_idin Jaeger - Metric Labels: (future: exemplars)
We use ArgoCD for continuous deployment. The cluster state automatically syncs with the charts/sre-platform directory in this repository.
- Merge a Pull Request to
main. - ArgoCD detects the change (commit hash).
- Cluster is automatically synced to the new state (Self-Healing enabled).
# From project root
helm upgrade --install sre-platform ./charts/sre-platform \
--set api.image.repository=... \
--set api.image.tag=latestdocker-compose up --build- Clean Architecture (
/cmd,/internal) - Gin HTTP framework with middleware stack
- Graceful shutdown with context cancellation
- Configuration via environment variables (Viper)
- Multi-stage Dockerfile with distroless base
- Docker Compose for local development
- Zerolog structured JSON logging
- OpenTelemetry distributed tracing
- Prometheus metrics endpoint
- Health probes (
/healthz,/ready,/version,/debug/info) - Request ID middleware for log correlation
- Rate limiting middleware
- Helm charts with HPA
- Inject
trace_idinto all logs - Define SLIs/SLOs in documentation
- Create Grafana dashboards
- GitOps Deployment (ArgoCD)
- GitHub Actions CI (Test & Build)
- cert-manager for automatic HTTPS
- Network Policies (deny-all default)
- External Secrets Operator
- Chaos engineering endpoints
- Load testing with k6
- Portfolio website at sanjeevsethi.in
MIT License - See LICENSE for details.