SRE Platform Application

A Production-Grade Microservices Platform for Site Reliability Engineering

📖 Table of Contents

Project Overview
What Was Built
Architecture
Project Structure
Technology Decisions
Getting Started
API Endpoints
Observability
Deployment
Project Status

🎯 Project Overview

This repository is the application layer of a comprehensive SRE Portfolio project. It demonstrates real-world Site Reliability Engineering practices:

Principle	Implementation
Clean Architecture	Separated concerns into `cmd/`, `internal/`, with clear boundaries
Observability	Structured logging (Zerolog), distributed tracing (OpenTelemetry), metrics (Prometheus)
Reliability	Graceful shutdown, health probes, circuit breakers, rate limiting
Security	Distroless containers, non-root execution, minimal attack surface
Infrastructure	Kubernetes-native with Helm charts, HPA, PDB

Related Repository

Infrastructure (Terraform): sre-platform-infra — Provisions GKE cluster, VPC, and Cloud DNS on GCP

🔨 What Was Built

The Journey (In Sequence)

Phase 1: Foundation ✅

Infrastructure Setup — Created GKE Autopilot cluster using Terraform
Networking — Configured VPC, subnets, and firewall rules
Remote State — Terraform state stored in GCS bucket

Phase 2: Application Development ✅

Clean Architecture — Structured codebase following Go best practices
Microservices — Built api-service and worker-service
Configuration — Environment-based config loading with Viper
Containerization — Multi-stage Dockerfiles with distroless images
Local Development — Docker Compose for full-stack testing

Phase 3: Observability 🟡 (Partial)

Structured Logging — JSON logs via Zerolog with request correlation
Distributed Tracing — OpenTelemetry integration with Jaeger
Metrics — Prometheus endpoints with custom business metrics
Health Probes — Liveness (/healthz), Readiness (/ready), Debug (/debug/info)

Phase 4: Production Hardening 🟡 (In Progress)

Helm Charts — Kubernetes deployment automation
Rate Limiting — Token bucket algorithm protecting API endpoints

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         KUBERNETES CLUSTER                       │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────┐  │
│  │   api-service   │───▶│      Redis      │◀───│   worker    │  │
│  │    (Gin HTTP)   │    │    (Queue)      │    │  (Consumer) │  │
│  └────────┬────────┘    └─────────────────┘    └─────────────┘  │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │     Jaeger      │  ◀── OpenTelemetry Traces                  │
│  │   (Tracing UI)  │                                            │
│  └─────────────────┘                                            │
└─────────────────────────────────────────────────────────────────┘

Data Flow

sequenceDiagram
    participant User
    participant API as api-service
    participant Redis
    participant Worker as worker-service

    User->>API: POST /jobs {"payload": "data"}
    API->>API: Validate + Generate Request ID
    API->>Redis: LPUSH job (with trace context)
    API-->>User: 202 Accepted {job_id}
    
    Worker->>Redis: BRPOP (blocking pop)
    Redis-->>Worker: Job data
    Worker->>Worker: Process job
    Worker->>Worker: Log with correlated Request ID

📁 Project Structure

sre-platform-app/
├── cmd/                          # Application entrypoints
│   ├── api-service/              # HTTP API server
│   │   └── main.go               # Bootstraps server, middleware, graceful shutdown
│   ├── worker-service/           # Background job processor
│   │   └── main.go               # Consumes Redis queue, processes jobs
│   └── platform-healthcheck/     # Lightweight healthcheck binary
│       └── main.go               # Used in Dockerfile HEALTHCHECK
│
├── internal/                     # Private application code
│   ├── api/                      # HTTP handlers and middleware
│   │   ├── server.go             # Route definitions (/healthz, /ready, /metrics, etc.)
│   │   └── middleware.go         # RequestID, RateLimit, Metrics, Logger middleware
│   ├── config/                   # Configuration loading
│   │   └── config.go             # Viper-based env/flag config
│   ├── logger/                   # Structured logging setup
│   │   └── logger.go             # Zerolog initialization
│   ├── metadata/                 # Build information
│   │   └── metadata.go           # Version, CommitSHA, BuildTime (injected at build)
│   ├── queue/                    # Redis queue abstraction
│   │   └── producer.go           # Job enqueueing with circuit breaker
│   ├── telemetry/                # Observability setup
│   │   └── tracing.go            # OpenTelemetry tracer initialization
│   └── worker/                   # Job processing logic
│       └── consumer.go           # Redis consumer with graceful shutdown
│
├── argocd-app.yaml               # ArgoCD Application Manifest (GitOps)
├── charts/                       # Helm charts for Kubernetes deployment
│   └── sre-platform/
│       ├── Chart.yaml
│       ├── values.yaml
│       └── templates/
│           ├── api-deployment.yaml
│           ├── api-service.yaml
│           ├── api-hpa.yaml
│           ├── worker-deployment.yaml
│           ├── worker-hpa.yaml
│           ├── pdb.yaml
│           └── redis.yaml
│
├── k8s_legacy/                   # Legacy raw Kubernetes manifests (deprecated)
├── Dockerfile                    # Multi-stage build for both services
├── docker-compose.yaml           # Local development stack
├── go.mod / go.sum               # Go module dependencies
└── SRE.txt                       # Master project plan (6 phases)

Why This Structure?

Directory	Purpose	SRE Benefit
`cmd/`	Thin entrypoints only	Easy to understand startup sequence
`internal/`	Business logic hidden	Prevents accidental external imports
`internal/api/`	HTTP layer isolated	Can test handlers without full server
`internal/queue/`	Queue abstraction	Can swap Redis for SQS/Kafka later
`charts/`	Helm-based deployment	Reproducible, parameterized releases

🛠️ Technology Decisions

Why Go?

Performance: Compiled, statically typed, low memory footprint
Concurrency: Goroutines for handling thousands of connections
Small Binaries: ~10MB final image size
Cloud Native: First-class Kubernetes, Prometheus, OTel support

Why Gin Framework?

Fast: One of the fastest Go HTTP routers
Middleware Ecosystem: Easy to add logging, tracing, auth
Production Proven: Used by companies like Grab, Riot Games

Why Zerolog for Logging?

Zero Allocation: Fastest structured logger for Go
JSON Output: Machine-parseable for log aggregation
Context Integration: Easy request ID propagation

Why OpenTelemetry?

Vendor Neutral: Export to Jaeger, Zipkin, Google Cloud Trace, etc.
Future Standard: CNCF project, replacing OpenTracing/OpenCensus
Auto-instrumentation: Middleware for Gin included

Why Distroless Containers?

Security: No shell, no package manager, no attack surface
Size: ~3MB base vs ~5MB Alpine vs ~100MB Debian
CVE-Free: No OS packages to patch

🚀 Getting Started

Prerequisites

Go 1.21+
Docker & Docker Compose
kubectl (for Kubernetes deployment)
helm (for Helm deployment)

Local Development

# Clone the repository
git clone https://github.com/Sanjeevliv/sre-platform-app.git
cd sre-platform-app

# Start the full stack (API, Worker, Redis, Jaeger)
docker-compose up --build

# In another terminal, test the API
curl http://localhost:8080/healthz
# Output: ok

curl http://localhost:8080/version
# Output: {"version":"dev","commit_sha":"none","build_time":"unknown","go_version":"go1.25"}

# Submit a job
curl -X POST http://localhost:8080/jobs \
  -H "Content-Type: application/json" \
  -d '{"payload": "Hello SRE World"}'
# Output: {"job_id":"uuid-here","status":"queued"}

# View traces
open http://localhost:16686  # Jaeger UI

Environment Variables

Variable	Default	Description
`API_PORT`	`8080`	HTTP server port
`REDIS_ADDR`	`localhost:6379`	Redis connection string
`GIN_MODE`	`debug`	Gin mode (`debug`/`release`)
`OTEL_EXPORTER_OTLP_ENDPOINT`	`localhost:4318`	OpenTelemetry collector
`RATE_LIMIT_RPS`	`100`	Requests per second limit
`RATE_LIMIT_BURST`	`200`	Burst capacity

📡 API Endpoints

Endpoint	Method	Purpose	Response
`/`	GET	Root handler	`SRE Platform API Service`
`/healthz`	GET	Liveness probe	`ok`
`/ready`	GET	Readiness probe	`ready`
`/version`	GET	Build metadata	`{"version":"...","commit_sha":"..."}`
`/debug/info`	GET	Runtime diagnostics	`{"goroutines":5,"memory_alloc":...}`
`/metrics`	GET	Prometheus metrics	Prometheus text format
`/jobs`	POST	Submit background job	`{"job_id":"...","status":"queued"}`

Health Probes Explained

# Kubernetes uses these probes:
livenessProbe:
  httpGet:
    path: /healthz    # "Am I alive?" - restart if fails
readinessProbe:
  httpGet:
    path: /ready      # "Can I serve traffic?" - remove from LB if fails

📊 Observability

Observability Stack

1. Logs (Structured JSON)

{
  "level": "info",
  "request_id": "abc-123",
  "method": "POST",
  "path": "/jobs",
  "status": 202,
  "latency_ms": 15,
  "message": "request completed"
}

2. Traces (OpenTelemetry → Jaeger)

Every request gets a trace ID
Spans created for HTTP handlers, Redis operations
View in Jaeger UI at http://localhost:16686

3. Metrics (Prometheus)

# Custom business metrics
http_requests_total{method="POST",path="/jobs",status="202"} 150
http_request_duration_seconds_bucket{le="0.1"} 145

4. Reliability Targets (SLIs/SLOs)

Service Level Indicators (SLIs): Defined via Prometheus recording rules (Availability, Latency).
Service Level Objectives (SLOs): 99.9% Availability, <300ms Latency (p99).
Alerting: Multi-window burn rate alerts to protect the Error Budget.

Correlation

All three pillars share the same request_id:

Log: "request_id": "abc-123"
Trace: trace_id in Jaeger
Metric Labels: (future: exemplars)

🚢 Deployment

🚢 Deployment (GitOps)

We use ArgoCD for continuous deployment. The cluster state automatically syncs with the charts/sre-platform directory in this repository.

Option 1: GitOps (Automatic & Recommended)

Merge a Pull Request to main.
ArgoCD detects the change (commit hash).
Cluster is automatically synced to the new state (Self-Healing enabled).

Option 2: Helm (Manual / Debug)

# From project root
helm upgrade --install sre-platform ./charts/sre-platform \
  --set api.image.repository=... \
  --set api.image.tag=latest

Option 3: Docker Compose (Local)

docker-compose up --build

📋 Project Status

Completed ✅

In Progress 🟡

Inject trace_id into all logs
Define SLIs/SLOs in documentation
Create Grafana dashboards
GitOps Deployment (ArgoCD)
GitHub Actions CI (Test & Build)
cert-manager for automatic HTTPS

Planned 📝

Network Policies (deny-all default)
External Secrets Operator
Chaos engineering endpoints
Load testing with k6
Portfolio website at sanjeevsethi.in

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
charts/sre-platform		charts/sre-platform
cmd		cmd
deploy		deploy
docs		docs
internal		internal
k8s_legacy		k8s_legacy
scripts		scripts
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
README.md		README.md
SRE.txt		SRE.txt
api-service		api-service
argocd-app.yaml		argocd-app.yaml
docker-compose.yaml		docker-compose.yaml
dump.rdb		dump.rdb
go.mod		go.mod
go.sum		go.sum
prometheus.yml		prometheus.yml
test_rate_limit.go		test_rate_limit.go

Folders and files

Latest commit

History

Repository files navigation

SRE Platform Application

📖 Table of Contents

🎯 Project Overview

Related Repository

🔨 What Was Built

The Journey (In Sequence)

Phase 1: Foundation ✅

Phase 2: Application Development ✅

Phase 3: Observability 🟡 (Partial)

Phase 4: Production Hardening 🟡 (In Progress)

🏗️ Architecture

Data Flow

📁 Project Structure

Why This Structure?

🛠️ Technology Decisions

Why Go?

Why Gin Framework?

Why Zerolog for Logging?

Why OpenTelemetry?

Why Distroless Containers?

🚀 Getting Started

Prerequisites

Local Development

Environment Variables

📡 API Endpoints

Health Probes Explained

📊 Observability

Observability Stack

1. Logs (Structured JSON)

2. Traces (OpenTelemetry → Jaeger)

3. Metrics (Prometheus)

4. Reliability Targets (SLIs/SLOs)

Correlation

🚢 Deployment

🚢 Deployment (GitOps)

Option 1: GitOps (Automatic & Recommended)

Option 2: Helm (Manual / Debug)

Option 3: Docker Compose (Local)

📋 Project Status

Completed ✅

In Progress 🟡

Planned 📝

📚 References

📄 License

CI/CD Pipeline Enabled

Key rotated Tue Dec 9 10:40:23 IST 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages