|
| 1 | +# Week 10: Observability Challenge with Prometheus and Grafana on KIND/EKS |
| 2 | + |
| 3 | +This challenge is part of the 90DaysOfDevOps program and focuses on solving advanced, production-grade observability scenarios using Prometheus and Grafana. You will deploy, configure, and fine-tune monitoring and alerting systems on a KIND cluster, and as a bonus, monitor and log an AWS EKS cluster. This exercise is designed to push your skills with advanced configurations, custom queries, dynamic dashboards, and robust alerting mechanisms, while preparing you for technical interviews. |
| 4 | + |
| 5 | +**Important:** |
| 6 | +1. Fork the [online_shop repository](https://github.com/Amitabh-DevOps/online_shop) and implement all tasks on your fork. |
| 7 | +2. Document all steps, commands, screenshots, and observations in a file named `solution.md` within your fork. |
| 8 | +3. Submit your `solution.md` file in the Week 10 (Observability) task folder of the 90DaysOfDevOps repository. |
| 9 | + |
| 10 | +--- |
| 11 | + |
| 12 | +## Task 1: Setup a KIND Cluster for Observability |
| 13 | + |
| 14 | +**Real-World Scenario:** |
| 15 | +Simulate a production-like Kubernetes environment locally by creating a KIND cluster to serve as the foundation for your monitoring setup. |
| 16 | + |
| 17 | +**Steps:** |
| 18 | +1. **Install KIND:** |
| 19 | + - Follow the official KIND installation guide. |
| 20 | +2. **Create a KIND Cluster:** |
| 21 | + - Run: |
| 22 | + ```bash |
| 23 | + kind create cluster --name observability-cluster |
| 24 | + ``` |
| 25 | +3. **Verify the Cluster:** |
| 26 | + - Run `kubectl get nodes` and capture the output. |
| 27 | +4. **Document in `solution.md`:** |
| 28 | + - Include installation steps, the commands used, and output from `kubectl get nodes`. |
| 29 | + |
| 30 | +**Interview Questions:** |
| 31 | +- What are the benefits and limitations of using KIND for production-like testing? |
| 32 | +- How can you simulate production scenarios using a local KIND cluster? |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Task 2: Deploy Prometheus on KIND with Advanced Configurations |
| 37 | + |
| 38 | +**Real-World Scenario:** |
| 39 | +Deploy Prometheus on your KIND cluster with a custom configuration that includes advanced scrape settings and relabeling rules to ensure high-quality metric collection. |
| 40 | + |
| 41 | +**Steps:** |
| 42 | +1. **Create a Custom Prometheus Configuration:** |
| 43 | + - Write a `prometheus.yml` with custom scrape configurations targeting cluster components (e.g., kube-state-metrics, Node Exporter) and advanced relabeling rules to clean up metric labels. |
| 44 | +2. **Deploy Prometheus:** |
| 45 | + - Deploy Prometheus using a Kubernetes Deployment or via a Helm chart. |
| 46 | +3. **Verify and Tune:** |
| 47 | + - Access the Prometheus UI to verify that metrics are being scraped as expected. |
| 48 | + - Adjust relabeling rules and scrape intervals to optimize performance. |
| 49 | +4. **Document in `solution.md`:** |
| 50 | + - Include your `prometheus.yml` and screenshots of the Prometheus UI showing active targets and effective relabeling. |
| 51 | + |
| 52 | +**Interview Questions:** |
| 53 | +- How do advanced relabeling rules refine metric collection in Prometheus? |
| 54 | +- What performance issues might you encounter when scraping targets on a KIND cluster, and how would you address them? |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## Task 3: Deploy Grafana and Build Production-Grade Dashboards |
| 59 | + |
| 60 | +**Real-World Scenario:** |
| 61 | +Deploy Grafana on your KIND cluster and configure it to use Prometheus as a data source. Then, create dashboards that reflect real production metrics, including custom queries and complex visualizations. |
| 62 | + |
| 63 | +**Steps:** |
| 64 | +1. **Deploy Grafana:** |
| 65 | + - Create a Kubernetes Deployment and Service for Grafana. |
| 66 | +2. **Configure the Data Source:** |
| 67 | + - In the Grafana UI, add Prometheus as a data source. |
| 68 | +3. **Design Production Dashboards:** |
| 69 | + - Create dashboards with panels that display key metrics (e.g., CPU, memory, disk I/O, network latency) using advanced PromQL queries. |
| 70 | + - Customize panel visualizations (e.g., graphs, tables, heatmaps) to present data effectively. |
| 71 | +4. **Document in `solution.md`:** |
| 72 | + - Include configuration details, screenshots of dashboards, and an explanation of the queries and visualization choices. |
| 73 | + |
| 74 | +**Interview Questions:** |
| 75 | +- What factors are critical when designing dashboards for production monitoring? |
| 76 | +- How do you optimize PromQL queries for performance and clarity in Grafana? |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Task 4: Configure Alerting and Notification Rules |
| 81 | + |
| 82 | +**Real-World Scenario:** |
| 83 | +Establish robust alerting to detect critical issues (e.g., resource exhaustion, node failures) and notify the operations team immediately. |
| 84 | + |
| 85 | +**Steps:** |
| 86 | +1. **Define Alerting Rules:** |
| 87 | + - Add alerting rules in `prometheus.yml` or configure Prometheus Alertmanager for specific conditions. |
| 88 | +2. **Configure Notification Channels:** |
| 89 | + - Set up Grafana (or Alertmanager) to send notifications via email, Slack, or another channel. |
| 90 | +3. **Test Alerts:** |
| 91 | + - Simulate alert conditions (e.g., by temporarily reducing resources) to verify that notifications are sent. |
| 92 | +4. **Document in `solution.md`:** |
| 93 | + - Include your alerting configuration, screenshots of triggered alerts, and a brief rationale for chosen thresholds. |
| 94 | + |
| 95 | +**Interview Questions:** |
| 96 | +- How do you design effective alerting rules to minimize false positives in production? |
| 97 | +- What challenges do you face in configuring notifications for a dynamic environment? |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## Task 5: Deploy Node Exporter for Enhanced System Metrics |
| 102 | + |
| 103 | +**Real-World Scenario:** |
| 104 | +Enhance system monitoring by deploying Node Exporter on your KIND cluster to collect detailed metrics such as CPU, memory, disk, and network usage, which are critical for troubleshooting production issues. |
| 105 | + |
| 106 | +**Steps:** |
| 107 | +1. **Deploy Node Exporter:** |
| 108 | + - Create a Deployment or DaemonSet to deploy Node Exporter across all nodes in your KIND cluster. |
| 109 | +2. **Verify Metrics Collection:** |
| 110 | + - Ensure Node Exporter endpoints are correctly scraped by Prometheus. |
| 111 | +3. **Document in `solution.md`:** |
| 112 | + - Include your Node Exporter YAML configuration and screenshots showing metrics collected in Prometheus. |
| 113 | + - Explain the importance of system-level metrics in production monitoring. |
| 114 | + |
| 115 | +**Interview Questions:** |
| 116 | +- What additional system metrics does Node Exporter provide that are crucial for production? |
| 117 | +- How would you integrate Node Exporter metrics into your existing Prometheus setup? |
| 118 | + |
| 119 | +--- |
| 120 | + |
| 121 | +## Bonus Task: Monitor and Log an AWS EKS Cluster |
| 122 | + |
| 123 | +**Real-World Scenario:** |
| 124 | +For an added challenge, provision or use an existing AWS EKS cluster and set up Prometheus and Grafana to monitor and log its performance. This task simulates the observability of a production cloud environment. |
| 125 | + |
| 126 | +**Steps:** |
| 127 | +1. **Provision an EKS Cluster:** |
| 128 | + - Use Terraform to deploy an EKS cluster (or leverage an existing one) and document key configuration settings. |
| 129 | +2. **Deploy Prometheus and Grafana on EKS:** |
| 130 | + - Configure Prometheus with appropriate scrape targets for the EKS cluster. |
| 131 | + - Deploy Grafana and integrate it with Prometheus. |
| 132 | +3. **Integrate Logging (Optional):** |
| 133 | + - Optionally, configure a logging solution (e.g., Fluentd or CloudWatch) to capture EKS logs. |
| 134 | +4. **Document in `solution.md`:** |
| 135 | + - Summarize your EKS provisioning steps, Prometheus and Grafana configurations, and any logging integration. |
| 136 | + - Explain how monitoring and logging improve observability in a cloud environment. |
| 137 | + |
| 138 | +**Interview Questions:** |
| 139 | +- What are the key challenges of monitoring an EKS cluster versus a local KIND cluster? |
| 140 | +- How would you integrate logging with monitoring tools to ensure comprehensive observability? |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +## How to Submit |
| 145 | + |
| 146 | +1. **Push Your Final Work to GitHub:** |
| 147 | + - Fork the [online_shop repository](https://github.com/Amitabh-DevOps/online_shop) and ensure all files (Prometheus and Grafana configurations, Node Exporter YAML, Terraform files for the bonus task, `solution.md`, etc.) are committed and pushed to your fork. |
| 148 | + |
| 149 | +2. **Create a Pull Request (PR):** |
| 150 | + - Open a PR from your branch (e.g., `observability-challenge`) to the main repository. |
| 151 | + - **Title:** |
| 152 | + ``` |
| 153 | + Week 10 Challenge - Observability Challenge (Prometheus & Grafana on KIND/EKS) |
| 154 | + ``` |
| 155 | + - **PR Description:** |
| 156 | + - Summarize your approach, list key commands/configurations, and include screenshots or logs as evidence. |
| 157 | + |
| 158 | +3. **Submit Your Documentation:** |
| 159 | + - **Important:** Place your `solution.md` file in the Week 10 (Observability) task folder of the 90DaysOfDevOps repository. |
| 160 | + |
| 161 | +4. **Share Your Experience on LinkedIn:** |
| 162 | + - Write a post summarizing your Observability challenge experience. |
| 163 | + - Include key takeaways, challenges faced, and insights (e.g., KIND/EKS setup, advanced configurations, dashboard creation, alerting strategies, and Node Exporter integration). |
| 164 | + - Use the hashtags: **#90DaysOfDevOps #Prometheus #Grafana #KIND #EKS #Observability #DevOps #InterviewPrep** |
| 165 | + - Optionally, provide links to your repository or blog posts detailing your journey. |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## TrainWithShubham Resources for Observability |
| 170 | + |
| 171 | +- **[Prometheus & Grafana One-Shot Video](https://youtu.be/DXZUunEeHqM?si=go1m-THyng7Ipyu6)** |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## Additional Resources |
| 176 | + |
| 177 | +- **[Prometheus Official Documentation](https://prometheus.io/docs/)** |
| 178 | +- **[Grafana Official Documentation](https://grafana.com/docs/)** |
| 179 | +- **[Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)** |
| 180 | +- **[Kubernetes Monitoring with Prometheus](https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/)** |
| 181 | +- **[Grafana Dashboards](https://grafana.com/grafana/dashboards/)** |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +Complete these tasks, answer the interview questions in your documentation, and use your work as a reference to prepare for real-world DevOps challenges and technical interviews. |
0 commit comments