Site Reliability Engineer | Platform Engineering | Distributed Systems
Early-career engineer with senior-level systems thinking. Building production-grade cloud platforms demonstrating reliability, observability, and automation principles used by Google, Netflix, and Uber.
Current: Production-grade Kubernetes platforms, SRE observability, GitOps at scale
Approach: Build systems that teach industry patterns, not tutorials
Philosophy: Infrastructure should be boring (reliable), not exciting (breaking)
💼 Open to opportunities: DevOps Engineer | SRE | Platform Engineer | Cloud Engineer
📍 Location: Pune, India (Open to Remote & Relocation)
I specialize in cloud-native infrastructure and platform engineering, with hands-on experience building:
- Microservices platforms on AWS EKS with event-driven architecture (RabbitMQ, Kafka)
- Infrastructure as Code using Terraform with modular, reusable patterns
- GitOps workflows with ArgoCD for declarative, drift-free deployments
- Full observability stacks (Prometheus, Grafana, ELK, Jaeger)
- DevSecOps pipelines with automated security scanning (SonarQube, Trivy)
- 🔨 Building production-grade DevOps projects demonstrating enterprise patterns
- 📚 Deep-diving into Kubernetes (RBAC, Network Policies, Security, Operators)
- 🔐 Implementing DevSecOps practices (shift-left security, policy-as-code)
- 📊 Designing SRE observability systems (SLIs, SLOs, error budgets)
- 🤖 Exploring MLOps and AI-driven infrastructure automation
| Principle | What It Means | How I Apply It |
|---|---|---|
| Everything Fails | Design for failure, not success | Multi-AZ, circuit breakers, graceful degradation |
| Toil is the Enemy | Automate repetitive work | GitOps, drift detection, self-healing |
| Observability ≠ Monitoring | Understand unknowns | Distributed tracing, correlation IDs, SLOs |
| Security by Default | Zero trust | RBAC, Network Policies, no hardcoded secrets |
| Error Budgets | Balance velocity and reliability | SLI/SLO tracking, controlled risk |
I don't believe in:
- Manual deployments ("works on my machine" syndrome)
- Infrastructure without monitoring
- Code without tests or automation without guardrails
|
AWS |
Azure |
GCP |
Kubernetes |
Docker |
Terraform |
|
Ansible |
Helm |
Linux |
Jenkins |
GitHub Actions |
ArgoCD |
|
Grafana |
Prometheus |
ELK |
Go |
Python |
Git |
Most engineers: "I know Docker, Kubernetes, Terraform"
Me: "I understand distributed systems failure modes and design infrastructure that degrades gracefully. I use Kubernetes for declarative state reconciliation and self-healing, not because it's trendy."
Most engineers: "My app works in testing"
Me: "I've tested:
- What happens when RabbitMQ goes down? (DLQ prevents message loss)
- What if Redis crashes? (Cache-aside handles misses)
- What if AWS loses an AZ? (Multi-AZ with auto-failover)"
Most engineers: "I built it"
Me: "I documented:
- WHY I chose RabbitMQ over Kafka (trade-off analysis)
- Architecture diagrams (system design)
- Runbooks (production operations)
- What I learned from failures"
I explore the trade-offs in distributed systems, documenting my journey from "how it works" to "why it breaks."
"I'm fascinated by systems that scale, self-heal, and never go down."
- 📝 Part 10/10: The SRE Mindset- Engineering Systems That Do Not Depend on You — 16 Jan 2026
- 📝 DevSecOps: Engineering Security as a Non-Negotiable Quality Gate — 14 Jan 2026
- 📝 FinOps — How SREs Turn “Cost Centers” into “Efficiency Engines” — 10 Jan 2026
- 📝 GitOps at Scale: Why “Sync” is the New “Apply” — Architecting a Self-Healing Multi-Cluster Platform — 07 Jan 2026
- 📝 Networking — The SRE’s Guide to the 504 Gateway Timeout — 04 Jan 2026
I document the "why" behind my code — deep dives into Engineering Systems, FinOps, Scalability, and SRE practices.
Active Participation:
- Google Developer Group (GDG) Pune - Cloud-native discussions, hands-on labs
- CNCF Community - Kubernetes, service mesh, observability
- AWS User Group Pune - Best practices, architecture patterns
- Atlassian Community Pune - CI/CD, DevOps automation
Technical:
- ✅ Build 4 production-grade cloud platforms (End-to-End)
- 🔄 Contribute to CNCF projects (Kubernetes, Prometheus, ArgoCD)
- 📚 Deep-dive into Kubernetes operators and CRDs
- 🔐 Master service mesh (Istio/Linkerd) and zero-trust networking
- 🤖 Explore MLOps and infrastructure for ML workloads
Professional:
- 📝 Publish 25+ in-depth technical articles
- 🎤 Present at CNCF Pune and AWS User Group
- 💼 Land first DevOps/SRE role as a early-career engineer
- 🌟 Contribute to open-source (Kubernetes, Terraform providers, Helm charts)
Learning:
- 📖 Complete AWS DevOps Professional certification
- 📖 Complete CKA and CKS certifications before 2027
- 📖 Study distributed systems papers (Raft, Paxos, CAP theorem)
→ How does Kubernetes handle split-brain in etcd?
→ What's the optimal error budget for a new service?
→ How do you design alerts that don't cause fatigue?
→ What's the CAP theorem trade-off in my architecture?
→ How would Netflix design this system?
→ What's the failure mode I haven't considered?
I don't just want to use tools. I want to understand the engineering decisions behind them.
Currently Reading:
- 📖 Site Reliability Engineering (Google SRE Book)
- 📖 Designing Data-Intensive Applications (Martin Kleppmann)
- 📖 Kubernetes Patterns (Bilgin Ibryam)
- 📖 Raft Consensus Paper (understanding distributed systems)
⚡ "Automation is not about replacing humans, it's about freeing them to do what they do best."
