Skip to content

Sanjeevliv/sre-platform-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SRE Platform Application

A Production-Grade Microservices Platform for Site Reliability Engineering

Go Kubernetes OpenTelemetry Prometheus ArgoCD


πŸ“– Table of Contents


🎯 Project Overview

This repository is the application layer of a comprehensive SRE Portfolio project. It demonstrates real-world Site Reliability Engineering practices:

Principle Implementation
Clean Architecture Separated concerns into cmd/, internal/, with clear boundaries
Observability Structured logging (Zerolog), distributed tracing (OpenTelemetry), metrics (Prometheus)
Reliability Graceful shutdown, health probes, circuit breakers, rate limiting
Security Distroless containers, non-root execution, minimal attack surface
Infrastructure Kubernetes-native with Helm charts, HPA, PDB

Related Repository

  • Infrastructure (Terraform): sre-platform-infra β€” Provisions GKE cluster, VPC, and Cloud DNS on GCP

πŸ”¨ What Was Built

The Journey (In Sequence)

Phase 1: Foundation βœ…

  1. Infrastructure Setup β€” Created GKE Autopilot cluster using Terraform
  2. Networking β€” Configured VPC, subnets, and firewall rules
  3. Remote State β€” Terraform state stored in GCS bucket

Phase 2: Application Development βœ…

  1. Clean Architecture β€” Structured codebase following Go best practices
  2. Microservices β€” Built api-service and worker-service
  3. Configuration β€” Environment-based config loading with Viper
  4. Containerization β€” Multi-stage Dockerfiles with distroless images
  5. Local Development β€” Docker Compose for full-stack testing

Phase 3: Observability 🟑 (Partial)

  1. Structured Logging β€” JSON logs via Zerolog with request correlation
  2. Distributed Tracing β€” OpenTelemetry integration with Jaeger
  3. Metrics β€” Prometheus endpoints with custom business metrics
  4. Health Probes β€” Liveness (/healthz), Readiness (/ready), Debug (/debug/info)

Phase 4: Production Hardening 🟑 (In Progress)

  1. Helm Charts β€” Kubernetes deployment automation
  2. Rate Limiting β€” Token bucket algorithm protecting API endpoints

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         KUBERNETES CLUSTER                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   api-service   │───▢│      Redis      │◀───│   worker    β”‚  β”‚
β”‚  β”‚    (Gin HTTP)   β”‚    β”‚    (Queue)      β”‚    β”‚  (Consumer) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚           β”‚                                                      β”‚
β”‚           β–Ό                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                            β”‚
β”‚  β”‚     Jaeger      β”‚  ◀── OpenTelemetry Traces                  β”‚
β”‚  β”‚   (Tracing UI)  β”‚                                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow

sequenceDiagram
    participant User
    participant API as api-service
    participant Redis
    participant Worker as worker-service

    User->>API: POST /jobs {"payload": "data"}
    API->>API: Validate + Generate Request ID
    API->>Redis: LPUSH job (with trace context)
    API-->>User: 202 Accepted {job_id}
    
    Worker->>Redis: BRPOP (blocking pop)
    Redis-->>Worker: Job data
    Worker->>Worker: Process job
    Worker->>Worker: Log with correlated Request ID
Loading

πŸ“ Project Structure

sre-platform-app/
β”œβ”€β”€ cmd/                          # Application entrypoints
β”‚   β”œβ”€β”€ api-service/              # HTTP API server
β”‚   β”‚   └── main.go               # Bootstraps server, middleware, graceful shutdown
β”‚   β”œβ”€β”€ worker-service/           # Background job processor
β”‚   β”‚   └── main.go               # Consumes Redis queue, processes jobs
β”‚   └── platform-healthcheck/     # Lightweight healthcheck binary
β”‚       └── main.go               # Used in Dockerfile HEALTHCHECK
β”‚
β”œβ”€β”€ internal/                     # Private application code
β”‚   β”œβ”€β”€ api/                      # HTTP handlers and middleware
β”‚   β”‚   β”œβ”€β”€ server.go             # Route definitions (/healthz, /ready, /metrics, etc.)
β”‚   β”‚   └── middleware.go         # RequestID, RateLimit, Metrics, Logger middleware
β”‚   β”œβ”€β”€ config/                   # Configuration loading
β”‚   β”‚   └── config.go             # Viper-based env/flag config
β”‚   β”œβ”€β”€ logger/                   # Structured logging setup
β”‚   β”‚   └── logger.go             # Zerolog initialization
β”‚   β”œβ”€β”€ metadata/                 # Build information
β”‚   β”‚   └── metadata.go           # Version, CommitSHA, BuildTime (injected at build)
β”‚   β”œβ”€β”€ queue/                    # Redis queue abstraction
β”‚   β”‚   └── producer.go           # Job enqueueing with circuit breaker
β”‚   β”œβ”€β”€ telemetry/                # Observability setup
β”‚   β”‚   └── tracing.go            # OpenTelemetry tracer initialization
β”‚   └── worker/                   # Job processing logic
β”‚       └── consumer.go           # Redis consumer with graceful shutdown
β”‚
β”œβ”€β”€ argocd-app.yaml               # ArgoCD Application Manifest (GitOps)
β”œβ”€β”€ charts/                       # Helm charts for Kubernetes deployment
β”‚   └── sre-platform/
β”‚       β”œβ”€β”€ Chart.yaml
β”‚       β”œβ”€β”€ values.yaml
β”‚       └── templates/
β”‚           β”œβ”€β”€ api-deployment.yaml
β”‚           β”œβ”€β”€ api-service.yaml
β”‚           β”œβ”€β”€ api-hpa.yaml
β”‚           β”œβ”€β”€ worker-deployment.yaml
β”‚           β”œβ”€β”€ worker-hpa.yaml
β”‚           β”œβ”€β”€ pdb.yaml
β”‚           └── redis.yaml
β”‚
β”œβ”€β”€ k8s_legacy/                   # Legacy raw Kubernetes manifests (deprecated)
β”œβ”€β”€ Dockerfile                    # Multi-stage build for both services
β”œβ”€β”€ docker-compose.yaml           # Local development stack
β”œβ”€β”€ go.mod / go.sum               # Go module dependencies
└── SRE.txt                       # Master project plan (6 phases)

Why This Structure?

Directory Purpose SRE Benefit
cmd/ Thin entrypoints only Easy to understand startup sequence
internal/ Business logic hidden Prevents accidental external imports
internal/api/ HTTP layer isolated Can test handlers without full server
internal/queue/ Queue abstraction Can swap Redis for SQS/Kafka later
charts/ Helm-based deployment Reproducible, parameterized releases

πŸ› οΈ Technology Decisions

Why Go?

  • Performance: Compiled, statically typed, low memory footprint
  • Concurrency: Goroutines for handling thousands of connections
  • Small Binaries: ~10MB final image size
  • Cloud Native: First-class Kubernetes, Prometheus, OTel support

Why Gin Framework?

  • Fast: One of the fastest Go HTTP routers
  • Middleware Ecosystem: Easy to add logging, tracing, auth
  • Production Proven: Used by companies like Grab, Riot Games

Why Zerolog for Logging?

  • Zero Allocation: Fastest structured logger for Go
  • JSON Output: Machine-parseable for log aggregation
  • Context Integration: Easy request ID propagation

Why OpenTelemetry?

  • Vendor Neutral: Export to Jaeger, Zipkin, Google Cloud Trace, etc.
  • Future Standard: CNCF project, replacing OpenTracing/OpenCensus
  • Auto-instrumentation: Middleware for Gin included

Why Distroless Containers?

  • Security: No shell, no package manager, no attack surface
  • Size: ~3MB base vs ~5MB Alpine vs ~100MB Debian
  • CVE-Free: No OS packages to patch

πŸš€ Getting Started

Prerequisites

  • Go 1.21+
  • Docker & Docker Compose
  • kubectl (for Kubernetes deployment)
  • helm (for Helm deployment)

Local Development

# Clone the repository
git clone https://github.com/Sanjeevliv/sre-platform-app.git
cd sre-platform-app

# Start the full stack (API, Worker, Redis, Jaeger)
docker-compose up --build

# In another terminal, test the API
curl http://localhost:8080/healthz
# Output: ok

curl http://localhost:8080/version
# Output: {"version":"dev","commit_sha":"none","build_time":"unknown","go_version":"go1.25"}

# Submit a job
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{"payload": "Hello SRE World"}'
# Output: {"job_id":"uuid-here","status":"queued"}

# View traces
open http://localhost:16686  # Jaeger UI

Environment Variables

Variable Default Description
API_PORT 8080 HTTP server port
REDIS_ADDR localhost:6379 Redis connection string
GIN_MODE debug Gin mode (debug/release)
OTEL_EXPORTER_OTLP_ENDPOINT localhost:4318 OpenTelemetry collector
RATE_LIMIT_RPS 100 Requests per second limit
RATE_LIMIT_BURST 200 Burst capacity

πŸ“‘ API Endpoints

Endpoint Method Purpose Response
/ GET Root handler SRE Platform API Service
/healthz GET Liveness probe ok
/ready GET Readiness probe ready
/version GET Build metadata {"version":"...","commit_sha":"..."}
/debug/info GET Runtime diagnostics {"goroutines":5,"memory_alloc":...}
/metrics GET Prometheus metrics Prometheus text format
/jobs POST Submit background job {"job_id":"...","status":"queued"}

Health Probes Explained

# Kubernetes uses these probes:
livenessProbe:
  httpGet:
    path: /healthz    # "Am I alive?" - restart if fails
readinessProbe:
  httpGet:
    path: /ready      # "Can I serve traffic?" - remove from LB if fails

πŸ“Š Observability

Observability Stack

1. Logs (Structured JSON)

{
  "level": "info",
  "request_id": "abc-123",
  "method": "POST",
  "path": "/jobs",
  "status": 202,
  "latency_ms": 15,
  "message": "request completed"
}

2. Traces (OpenTelemetry β†’ Jaeger)

  • Every request gets a trace ID
  • Spans created for HTTP handlers, Redis operations
  • View in Jaeger UI at http://localhost:16686

3. Metrics (Prometheus)

# Custom business metrics
http_requests_total{method="POST",path="/jobs",status="202"} 150
http_request_duration_seconds_bucket{le="0.1"} 145

4. Reliability Targets (SLIs/SLOs)

  • Service Level Indicators (SLIs): Defined via Prometheus recording rules (Availability, Latency).
  • Service Level Objectives (SLOs): 99.9% Availability, <300ms Latency (p99).
  • Alerting: Multi-window burn rate alerts to protect the Error Budget.

Correlation

All three pillars share the same request_id:

  • Log: "request_id": "abc-123"
  • Trace: trace_id in Jaeger
  • Metric Labels: (future: exemplars)

🚒 Deployment

🚒 Deployment (GitOps)

We use ArgoCD for continuous deployment. The cluster state automatically syncs with the charts/sre-platform directory in this repository.

Option 1: GitOps (Automatic & Recommended)

  1. Merge a Pull Request to main.
  2. ArgoCD detects the change (commit hash).
  3. Cluster is automatically synced to the new state (Self-Healing enabled).

Option 2: Helm (Manual / Debug)

# From project root
helm upgrade --install sre-platform ./charts/sre-platform \
  --set api.image.repository=... \
  --set api.image.tag=latest

Option 3: Docker Compose (Local)

docker-compose up --build

πŸ“‹ Project Status

Completed βœ…

  • Clean Architecture (/cmd, /internal)
  • Gin HTTP framework with middleware stack
  • Graceful shutdown with context cancellation
  • Configuration via environment variables (Viper)
  • Multi-stage Dockerfile with distroless base
  • Docker Compose for local development
  • Zerolog structured JSON logging
  • OpenTelemetry distributed tracing
  • Prometheus metrics endpoint
  • Health probes (/healthz, /ready, /version, /debug/info)
  • Request ID middleware for log correlation
  • Rate limiting middleware
  • Helm charts with HPA

In Progress 🟑

  • Inject trace_id into all logs
  • Define SLIs/SLOs in documentation
  • Create Grafana dashboards
  • GitOps Deployment (ArgoCD)
  • GitHub Actions CI (Test & Build)
  • cert-manager for automatic HTTPS

Planned πŸ“

  • Network Policies (deny-all default)
  • External Secrets Operator
  • Chaos engineering endpoints
  • Load testing with k6
  • Portfolio website at sanjeevsethi.in

πŸ“š References


πŸ“„ License

MIT License - See LICENSE for details.

CI/CD Pipeline Enabled

Key rotated Tue Dec 9 10:40:23 IST 2025

About

Production-grade Go microservices platform demonstrating SRE practices: observability, reliability, and infrastructure as code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors