Skip to content

Osomudeya/DevOps-Troubleshooting-Toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DevOps Troubleshooting Toolkit

DevOps Troubleshooting Toolkit Banner

License: MIT PRs Welcome GitHub stars GitHub forks GitHub downloads LinkedIn

A comprehensive collection of commands, tools, and methodologies for troubleshooting DevOps environments - from Linux to Kubernetes and beyond.

πŸ“– Table of Contents

πŸ”Ž About This Project

The DevOps Troubleshooting Toolkit is designed to be the definitive resource for diagnosing and resolving issues across the entire DevOps stack. This repository provides structured, practical guidance for engineers working with modern infrastructure and applications.

Why This Toolkit Exists

As systems grow more complex and distributed, troubleshooting becomes increasingly challenging. This toolkit aims to:

  • βœ… Provide structured approaches to solving common (and uncommon) problems
  • βœ… Document real-world solutions tested in production environments
  • βœ… Share institutional knowledge that typically takes years to accumulate
  • βœ… Reduce mean time to resolution (MTTR) for critical incidents
  • βœ… Offer copy-paste commands for immediate use

Who It's For

  • DevOps Engineers - Infrastructure and deployment pipeline management
  • Site Reliability Engineers (SREs) - Production system maintenance
  • Platform Engineers - Internal developer platform building
  • System Administrators - Linux environment management
  • Cloud Engineers - Multi-cloud provider expertise
  • Backend Developers - Application debugging in complex environments

πŸš€ Quick Start

Emergency Commands

# System Health Check
top -p $(pgrep -d',' -f your_app)
free -h && df -h
netstat -tulpn | grep LISTEN

# Container Quick Debug
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
docker logs --tail 100 -f container_name

# Kubernetes Emergency
kubectl get pods --all-namespaces | grep -v Running
kubectl top nodes && kubectl top pods

Installation

# Clone the repository
git clone https://github.com/Osomudeya/DevOps-Troubleshooting-Toolkit.git
cd DevOps-Troubleshooting-Toolkit

# Make scripts executable
chmod +x scripts/*.sh

# Optional: Add to PATH for global access
echo 'export PATH=$PATH:'$(pwd)'/scripts' >> ~/.bashrc
source ~/.bashrc

πŸ› οΈ Platform Guides

Linux Systems

Component Quick Access Description
Linux Commands linux/linux-commands.md Essential system diagnostics and troubleshooting

Container Platforms

Platform Quick Access Key Features
Docker containers/docker-troubleshooting.md Container lifecycle, networking, volumes

Kubernetes

Component Quick Access Coverage
Kubernetes kubernetes/kubernetes-troubleshooting.md Cluster management, workloads, networking

Cloud Providers

Provider Quick Access Specializations
AWS cloud/aws-troubleshooting.md EKS, Lambda, RDS, VPC troubleshooting
GCP cloud/gcp-troubleshooting.md GKE, Cloud Functions, networking
Azure cloud/azure-troubleshooting.md AKS, App Services, resource groups
Multi-Cloud cloud/multi-cloud-strategies.md Cross-platform strategies

Databases

Database Quick Access Focus Areas
Database Troubleshooting databases/database-troubleshooting.md Connection, performance, backup issues

Observability

Tool Quick Access Coverage
Prometheus & Grafana observability/prometheus-and-grafana.md Monitoring, alerting, dashboards

πŸ”₯ Common Issues

🚨 Critical System Issues

High Load Average

# Quick diagnosis
uptime && cat /proc/loadavg
ps aux --sort=-%cpu | head -10
iostat -x 1 5

# Deep dive
sar -u 1 10  # CPU utilization
sar -d 1 10  # Disk activity

πŸ‘‰ Detailed Guide: linux/linux-commands.md

Out of Memory (OOM)

# Check OOM killer logs
dmesg | grep -i "killed process"
journalctl -u your-service | grep -i oom

# Memory analysis
free -h && cat /proc/meminfo
ps aux --sort=-%mem | head -10

πŸ‘‰ Detailed Guide: linux/linux-commands.md

Disk Space Full

# Find large files and directories
df -h
du -sh /* 2>/dev/null | sort -hr | head -10
find / -type f -size +1G 2>/dev/null

# Log rotation check
journalctl --disk-usage

πŸ‘‰ Detailed Guide: linux/linux-commands.md

🐳 Container Issues

Container Won't Start

# Debug container startup
docker logs container_name
docker inspect container_name
docker run --rm -it image_name /bin/sh

# Resource constraints
docker stats container_name

πŸ‘‰ Detailed Guide: containers/docker-troubleshooting.md

Container Networking

# Network debugging
docker network ls
docker inspect network_name
docker exec container_name netstat -tulpn

πŸ‘‰ Detailed Guide: containers/docker-troubleshooting.md

☸️ Kubernetes Issues

Pods Stuck in Pending

# Check pod status
kubectl describe pod pod_name
kubectl get events --sort-by=.metadata.creationTimestamp

# Resource availability
kubectl top nodes
kubectl describe nodes

πŸ‘‰ Detailed Guide: kubernetes/kubernetes-troubleshooting.md

Service Not Accessible

# Service debugging
kubectl get svc,ep service_name
kubectl describe svc service_name
kubectl get pods -l app=your_app -o wide

πŸ‘‰ Detailed Guide: kubernetes/kubernetes-troubleshooting.md

🌐 Network Issues

DNS Resolution Failures

# DNS troubleshooting
nslookup domain.com
dig domain.com
systemd-resolve --status

Connection Timeouts

# Network connectivity
telnet host port
nc -zv host port
traceroute host

πŸ’Ύ Database Issues

Connection Problems

# Database connection check
mysql -h hostname -u username -p -e "SELECT 1"
psql -h hostname -U username -c "SELECT 1"
mongo --host hostname --eval "db.stats()"

πŸ‘‰ Detailed Guide: databases/database-troubleshooting.md

πŸ“Š Troubleshooting Scenarios

Real-World Examples

Scenario Difficulty Description Guide
Complete Troubleshooting Scenarios 🟒-πŸ”΄ All Levels End-to-end troubleshooting examples scenarios/scenarios.md

πŸ› οΈ Useful Scripts

Available Scripts

Usage Examples

# Repository management
./scripts/auto-clone-all-repos.sh    # Clone all org repositories
./scripts/auto-pull-all-repos.sh     # Update all local repositories

# Kubernetes tools
./scripts/kubernetes-events.sh       # Monitor K8s events real-time
./scripts/k8s-tailogs.sh            # Stream logs from multiple pods
./scripts/kubernetes-tools.sh       # Install essential K8s tools

πŸ“‚ Content Organization

DevOps-Troubleshooting-Toolkit/
β”œβ”€β”€ linux/                     # Linux system troubleshooting
β”‚   └── linux-commands.md      # Essential Linux commands
β”œβ”€β”€ containers/                 # Container platform issues
β”‚   └── docker-troubleshooting.md # Docker troubleshooting guide
β”œβ”€β”€ kubernetes/                 # K8s cluster and workload problems
β”‚   └── kubernetes-troubleshooting.md # Kubernetes troubleshooting
β”œβ”€β”€ cloud/                      # Cloud provider specific guides
β”‚   β”œβ”€β”€ aws-troubleshooting.md  # AWS troubleshooting
β”‚   β”œβ”€β”€ azure-troubleshooting.md # Azure troubleshooting
β”‚   β”œβ”€β”€ gcp-troubleshooting.md  # GCP troubleshooting
β”‚   └── multi-cloud-strategies.md # Multi-cloud strategies
β”œβ”€β”€ databases/                  # Database troubleshooting
β”‚   └── database-troubleshooting.md # Database issues
β”œβ”€β”€ observability/              # Monitoring, logging, and tracing
β”‚   └── prometheus-and-grafana.md # Prometheus & Grafana guide
β”œβ”€β”€ scenarios/                  # End-to-end troubleshooting scenarios
β”‚   └── scenarios.md           # Real-world scenarios
β”œβ”€β”€ scripts/                    # Automated troubleshooting scripts
β”‚   β”œβ”€β”€ auto-clone-all-repos.sh
β”‚   β”œβ”€β”€ auto-pull-all-repos.sh
β”‚   β”œβ”€β”€ k8s-tailogs.sh
β”‚   β”œβ”€β”€ kubernetes-events.sh
β”‚   └── kubernetes-tools.sh
└── assets/
    β”œβ”€β”€ images/                 # Repository images and diagrams
    └── cheatsheets/           # Printable reference materials

πŸ§ͺ Quick Tests & Validation

Database Connectivity

# MySQL/MariaDB
mysql -h hostname -u username -p -e "SELECT VERSION(), NOW();"

# PostgreSQL
psql -h hostname -U username -c "SELECT version();"

# MongoDB
mongosh --host hostname --eval "db.runCommand({ping: 1})"

# Redis
redis-cli -h hostname ping

Service Health Checks

# HTTP services
curl -I http://service-endpoint/health
wget --spider http://service-endpoint/health

# TCP services
nc -zv hostname port
telnet hostname port

Container Registry Access

# Docker Hub
docker pull hello-world

# Private registry
docker login registry.company.com
docker pull registry.company.com/app:latest

πŸ“Š Observability & Monitoring

Prometheus & Grafana

# Check Prometheus targets
curl -s http://localhost:9090/api/v1/targets | jq .

# Grafana API health
curl -H "Authorization: Bearer $GRAFANA_TOKEN" http://localhost:3000/api/health

πŸ‘‰ Full Guide: observability/prometheus-and-grafana.md

πŸ”„ Recently Updated

File Last Updated Changes
kubernetes/kubernetes-troubleshooting.md 2025-05-30 Added EKS-specific troubleshooting scenarios
cloud/aws-troubleshooting.md 2025-05-28 Enhanced Lambda cold start debugging
observability/prometheus-and-grafana.md 2025-05-25 Updated for Prometheus 2.50+ features
scripts/kubernetes-tools.sh 2025-05-20 Added resource quota validation

🌟 How to Contribute

We welcome contributions from the community! Here's how you can help:

Quick Contributions

  • πŸ› Report bugs - Found an issue? Create an issue
  • πŸ“– Improve docs - Fix typos, add examples, enhance explanations
  • πŸ”§ Add commands - Share your troubleshooting commands and techniques
  • 🎯 Real scenarios - Document actual production issues you've solved

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/DevOps-Troubleshooting-Toolkit.git
cd DevOps-Troubleshooting-Toolkit

# Create feature branch
git checkout -b feature/new-troubleshooting-guide

# Make changes and test
./scripts/validate-docs.sh

# Submit PR
git push origin feature/new-troubleshooting-guide

πŸ‘‰ Detailed Guide: CONTRIBUTING.md

πŸ“‹ Resources

Downloadable Materials

External Resources

Community & Support

πŸ“± Connect & Follow

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


⭐ If this toolkit helped you solve a problem, please star the repository! ⭐

"The best troubleshooters aren't those who know all the answers, but those who know where to find them."

Happy Troubleshooting! πŸš€

About

Troubleshooting Toolkit For DevOps

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages