A comprehensive collection of commands, tools, and methodologies for troubleshooting DevOps environments - from Linux to Kubernetes and beyond.
- About This Project
- Quick Start
- Platform Guides
- Common Issues
- Troubleshooting Scenarios
- Useful Scripts
- Content Organization
- Installation and Usage
- Contributing
- Resources
- License
The DevOps Troubleshooting Toolkit is designed to be the definitive resource for diagnosing and resolving issues across the entire DevOps stack. This repository provides structured, practical guidance for engineers working with modern infrastructure and applications.
As systems grow more complex and distributed, troubleshooting becomes increasingly challenging. This toolkit aims to:
- β Provide structured approaches to solving common (and uncommon) problems
- β Document real-world solutions tested in production environments
- β Share institutional knowledge that typically takes years to accumulate
- β Reduce mean time to resolution (MTTR) for critical incidents
- β Offer copy-paste commands for immediate use
- DevOps Engineers - Infrastructure and deployment pipeline management
- Site Reliability Engineers (SREs) - Production system maintenance
- Platform Engineers - Internal developer platform building
- System Administrators - Linux environment management
- Cloud Engineers - Multi-cloud provider expertise
- Backend Developers - Application debugging in complex environments
# System Health Check
top -p $(pgrep -d',' -f your_app)
free -h && df -h
netstat -tulpn | grep LISTEN
# Container Quick Debug
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
docker logs --tail 100 -f container_name
# Kubernetes Emergency
kubectl get pods --all-namespaces | grep -v Running
kubectl top nodes && kubectl top pods
# Clone the repository
git clone https://github.com/Osomudeya/DevOps-Troubleshooting-Toolkit.git
cd DevOps-Troubleshooting-Toolkit
# Make scripts executable
chmod +x scripts/*.sh
# Optional: Add to PATH for global access
echo 'export PATH=$PATH:'$(pwd)'/scripts' >> ~/.bashrc
source ~/.bashrc
Component | Quick Access | Description |
---|---|---|
Linux Commands | linux/linux-commands.md |
Essential system diagnostics and troubleshooting |
Platform | Quick Access | Key Features |
---|---|---|
Docker | containers/docker-troubleshooting.md |
Container lifecycle, networking, volumes |
Component | Quick Access | Coverage |
---|---|---|
Kubernetes | kubernetes/kubernetes-troubleshooting.md |
Cluster management, workloads, networking |
Provider | Quick Access | Specializations |
---|---|---|
AWS | cloud/aws-troubleshooting.md |
EKS, Lambda, RDS, VPC troubleshooting |
GCP | cloud/gcp-troubleshooting.md |
GKE, Cloud Functions, networking |
Azure | cloud/azure-troubleshooting.md |
AKS, App Services, resource groups |
Multi-Cloud | cloud/multi-cloud-strategies.md |
Cross-platform strategies |
Database | Quick Access | Focus Areas |
---|---|---|
Database Troubleshooting | databases/database-troubleshooting.md |
Connection, performance, backup issues |
Tool | Quick Access | Coverage |
---|---|---|
Prometheus & Grafana | observability/prometheus-and-grafana.md |
Monitoring, alerting, dashboards |
# Quick diagnosis
uptime && cat /proc/loadavg
ps aux --sort=-%cpu | head -10
iostat -x 1 5
# Deep dive
sar -u 1 10 # CPU utilization
sar -d 1 10 # Disk activity
π Detailed Guide: linux/linux-commands.md
# Check OOM killer logs
dmesg | grep -i "killed process"
journalctl -u your-service | grep -i oom
# Memory analysis
free -h && cat /proc/meminfo
ps aux --sort=-%mem | head -10
π Detailed Guide: linux/linux-commands.md
# Find large files and directories
df -h
du -sh /* 2>/dev/null | sort -hr | head -10
find / -type f -size +1G 2>/dev/null
# Log rotation check
journalctl --disk-usage
π Detailed Guide: linux/linux-commands.md
# Debug container startup
docker logs container_name
docker inspect container_name
docker run --rm -it image_name /bin/sh
# Resource constraints
docker stats container_name
π Detailed Guide: containers/docker-troubleshooting.md
# Network debugging
docker network ls
docker inspect network_name
docker exec container_name netstat -tulpn
π Detailed Guide: containers/docker-troubleshooting.md
# Check pod status
kubectl describe pod pod_name
kubectl get events --sort-by=.metadata.creationTimestamp
# Resource availability
kubectl top nodes
kubectl describe nodes
π Detailed Guide: kubernetes/kubernetes-troubleshooting.md
# Service debugging
kubectl get svc,ep service_name
kubectl describe svc service_name
kubectl get pods -l app=your_app -o wide
π Detailed Guide: kubernetes/kubernetes-troubleshooting.md
# DNS troubleshooting
nslookup domain.com
dig domain.com
systemd-resolve --status
# Network connectivity
telnet host port
nc -zv host port
traceroute host
# Database connection check
mysql -h hostname -u username -p -e "SELECT 1"
psql -h hostname -U username -c "SELECT 1"
mongo --host hostname --eval "db.stats()"
π Detailed Guide: databases/database-troubleshooting.md
Scenario | Difficulty | Description | Guide |
---|---|---|---|
Complete Troubleshooting Scenarios | π’-π΄ All Levels | End-to-end troubleshooting examples | scenarios/scenarios.md |
scripts/auto-clone-all-repos.sh
- Clone all repositories from an organizationscripts/auto-pull-all-repos.sh
- Update all local repositoriesscripts/k8s-tailogs.sh
- Stream logs from multiple Kubernetes podsscripts/kubernetes-events.sh
- Monitor Kubernetes events in real-timescripts/kubernetes-tools.sh
- Install essential Kubernetes tools
# Repository management
./scripts/auto-clone-all-repos.sh # Clone all org repositories
./scripts/auto-pull-all-repos.sh # Update all local repositories
# Kubernetes tools
./scripts/kubernetes-events.sh # Monitor K8s events real-time
./scripts/k8s-tailogs.sh # Stream logs from multiple pods
./scripts/kubernetes-tools.sh # Install essential K8s tools
DevOps-Troubleshooting-Toolkit/
βββ linux/ # Linux system troubleshooting
β βββ linux-commands.md # Essential Linux commands
βββ containers/ # Container platform issues
β βββ docker-troubleshooting.md # Docker troubleshooting guide
βββ kubernetes/ # K8s cluster and workload problems
β βββ kubernetes-troubleshooting.md # Kubernetes troubleshooting
βββ cloud/ # Cloud provider specific guides
β βββ aws-troubleshooting.md # AWS troubleshooting
β βββ azure-troubleshooting.md # Azure troubleshooting
β βββ gcp-troubleshooting.md # GCP troubleshooting
β βββ multi-cloud-strategies.md # Multi-cloud strategies
βββ databases/ # Database troubleshooting
β βββ database-troubleshooting.md # Database issues
βββ observability/ # Monitoring, logging, and tracing
β βββ prometheus-and-grafana.md # Prometheus & Grafana guide
βββ scenarios/ # End-to-end troubleshooting scenarios
β βββ scenarios.md # Real-world scenarios
βββ scripts/ # Automated troubleshooting scripts
β βββ auto-clone-all-repos.sh
β βββ auto-pull-all-repos.sh
β βββ k8s-tailogs.sh
β βββ kubernetes-events.sh
β βββ kubernetes-tools.sh
βββ assets/
βββ images/ # Repository images and diagrams
βββ cheatsheets/ # Printable reference materials
# MySQL/MariaDB
mysql -h hostname -u username -p -e "SELECT VERSION(), NOW();"
# PostgreSQL
psql -h hostname -U username -c "SELECT version();"
# MongoDB
mongosh --host hostname --eval "db.runCommand({ping: 1})"
# Redis
redis-cli -h hostname ping
# HTTP services
curl -I http://service-endpoint/health
wget --spider http://service-endpoint/health
# TCP services
nc -zv hostname port
telnet hostname port
# Docker Hub
docker pull hello-world
# Private registry
docker login registry.company.com
docker pull registry.company.com/app:latest
# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq .
# Grafana API health
curl -H "Authorization: Bearer $GRAFANA_TOKEN" http://localhost:3000/api/health
π Full Guide: observability/prometheus-and-grafana.md
File | Last Updated | Changes |
---|---|---|
kubernetes/kubernetes-troubleshooting.md |
2025-05-30 | Added EKS-specific troubleshooting scenarios |
cloud/aws-troubleshooting.md |
2025-05-28 | Enhanced Lambda cold start debugging |
observability/prometheus-and-grafana.md |
2025-05-25 | Updated for Prometheus 2.50+ features |
scripts/kubernetes-tools.sh |
2025-05-20 | Added resource quota validation |
We welcome contributions from the community! Here's how you can help:
- π Report bugs - Found an issue? Create an issue
- π Improve docs - Fix typos, add examples, enhance explanations
- π§ Add commands - Share your troubleshooting commands and techniques
- π― Real scenarios - Document actual production issues you've solved
# Fork and clone
git clone https://github.com/YOUR_USERNAME/DevOps-Troubleshooting-Toolkit.git
cd DevOps-Troubleshooting-Toolkit
# Create feature branch
git checkout -b feature/new-troubleshooting-guide
# Make changes and test
./scripts/validate-docs.sh
# Submit PR
git push origin feature/new-troubleshooting-guide
π Detailed Guide: CONTRIBUTING.md
- π¬ Discussions
- π Issues
- π Medium Articles
- πΌ LinkedIn
- π¦ Twitter
This project is licensed under the MIT License - see the LICENSE
file for details.