Phase 4.3 Monitoring Configuration Date: 2025-11-16 Status: Ready for deployment
# 1. Apply ConfigMap
kubectl apply -f k8s/22-grafana-alerts.yaml
# 2. Restart Grafana
kubectl rollout restart deployment/grafana -n insightlearn
# 3. Verify in UI
# URL: http://localhost:3000
# Navigate: Alerting → Alert rules
# Expected: 5 rules in "InsightLearn Critical Alerts"| # | Alert Name | Severity | Trigger | Duration | Action |
|---|---|---|---|---|---|
| 1 | API Health Check Failed | CRITICAL | API pod down | 2 min | Page on-call (15 min SLA) |
| 2 | High API Error Rate | WARNING | 5xx errors > 5% | 5 min | Review & investigate (30 min SLA) |
| 3 | Database Connection Failed | CRITICAL | SQL Server down | 1 min | Page on-call (15 min SLA) |
| 4 | High Memory Usage | WARNING | Memory > 85% | 5 min | Review logs (30 min SLA) |
| 5 | Slow API Response Time | WARNING | p95 > 2 seconds | 5 min | Check performance (30 min SLA) |
API_POD=$(kubectl get pods -n insightlearn -l app=insightlearn-api -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod $API_POD -n insightlearn
# Wait 2 minutes → Alert fires
# Pod restarts → Alert resolvesfor i in {1..100}; do
curl -s -X POST http://localhost:7001/api/invalid &
done
wait
# Wait 5 minutes with elevated error rate → Alert fireskubectl delete pod -n insightlearn $(kubectl get pods -n insightlearn -l app=sqlserver -o jsonpath='{.items[0].metadata.name}')
# Wait 1 minute → Alert fires
# Pod restarts → Alert resolves# Generate load
for i in {1..1000}; do
curl -s http://localhost:7001/api/courses &
done
wait
# Monitor: kubectl top pods -n insightlearn
# If memory > 85% for 5 min → Alert fires# Generate concurrent load
ab -n 1000 -c 50 http://localhost:7001/api/courses
# If p95 latency > 2s for 5 min → Alert fires| Alert | Threshold | Why This Value? |
|---|---|---|
| API Health | 2 min down | Allows pod restart time, prevents false positives |
| Error Rate | 5% errors | Baseline 0-1%, 5% = clear systemic issue |
| Database | 1 min down | Shorter than API (1 cascading impact), faster detection |
| Memory | 85% of limit | Leaves 15% for GC spikes, before OOMKilled at 100% |
| Response Time | p95 > 2s | Baseline p95 400-500ms, 2s = 4x slowdown (user-noticeable) |
| Alert | Check | Command |
|---|---|---|
| API Health | Pod status | kubectl get pods -n insightlearn | grep api |
| API Health | Logs | kubectl logs <pod> -n insightlearn | tail -100 |
| Error Rate | Error timeline | Grafana → API Performance dashboard |
| Error Rate | Dependency health | kubectl get pods -n insightlearn | grep -E "sql|redis|mongo" |
| Database | Pod status | kubectl get pods -n insightlearn | grep sqlserver |
| Database | Disk space | kubectl exec <pod> -it -- df -h |
| Memory | Current usage | kubectl top pods -n insightlearn |
| Memory | Increase limit | kubectl patch deployment ... --type json ... |
| Response Time | By endpoint | Grafana → API Performance → breakdown by path |
| Response Time | Database queries | kubectl logs <api-pod> -n insightlearn | grep Duration |
# Grafana (3000)
kubectl port-forward -n insightlearn svc/grafana 3000:3000 &
# Prometheus (9091)
kubectl port-forward -n insightlearn svc/prometheus 9091:9090 &
# Kill all port-forwards
pkill -f "kubectl port-forward"# Run comprehensive test
chmod +x test-grafana-alerts.sh
./test-grafana-alerts.sh
# Output: /tmp/grafana-alerts-test-<timestamp>.log
# Shows: Config verification, metric availability, alert structure, testing instructionsCurrent Status: Console logging (development)
Production Setup (requires manual configuration):
-
Webhook (recommended for API integration):
- Create endpoint: POST
/api/webhooks/alerts/{severity} - Receives JSON payload with alert details
- Useful for custom alerting logic
- Create endpoint: POST
-
Email (for WARNING alerts):
- SMTP server configured
- Recipients: platform-team@insightlearn.com
- Digest: All alerts grouped, sent hourly
-
Slack (for critical incidents):
- Webhook URL: Insert your Slack webhook
- Channel: #insightlearn-critical (CRITICAL alerts)
- Channel: #insightlearn-alerts (WARNING alerts)
-
PagerDuty (for on-call escalation):
- Service key: Insert PagerDuty service key
- Severity mapping: CRITICAL → page, WARNING → incident
- Auto-resolve when alert resolves
Configuration Steps:
Grafana UI:
1. Alerting → Notification channels
2. New channel
3. Type: Webhook/Email/Slack/PagerDuty
4. Configure details
5. Test notification
6. Save
7. Assign to alert rule
- CPU: +1% (evaluation of 5 rules every minute)
- Memory: <10 MB (rule metadata + query cache)
- Network: <1 MB/min (Prometheus scraping)
- Latency: <200ms evaluation time per rule cycle
Safe to scale to: 50+ alert rules before considering optimization
- Review alert summary in Grafana
- Check false positive rate (target: 0-2 per week)
- Update runbooks if procedures changed
- Comprehensive effectiveness report
- Analyze response time baselines
- Correlate alerts with user outages
- Adjust thresholds if needed
- Trend analysis (error rate, response time, memory)
- Capacity forecasting
- Review and update alert rules
| Issue | Cause | Fix |
|---|---|---|
| Alerts not showing | ConfigMap not loaded | kubectl rollout restart deployment/grafana -n insightlearn |
| "Metric not found" | Job not scraped | Verify Prometheus targets: http://localhost:9091/targets |
| Alerts flapping | Threshold too close to baseline | Increase threshold by 20-50% |
| Too many false positives | Threshold too low | Increase "for" duration or threshold value |
| No alerts firing | Threshold too high | Lower threshold to 1.5x baseline |
| Notifications not working | Channel not configured | Setup in Grafana UI + test notification |
| File | Purpose |
|---|---|
k8s/22-grafana-alerts.yaml |
Kubernetes ConfigMaps with alert rules |
test-grafana-alerts.sh |
Comprehensive testing & verification script |
docs/GRAFANA-ALERTS-CONFIGURATION.md |
Detailed configuration documentation |
ALERT-RULES-QUICK-REFERENCE.md |
This file (quick reference) |
| Metric | Source | Update Interval |
|---|---|---|
up{job="..."} |
Prometheus scrape success | 15 seconds |
http_requests_total{status="..."} |
API metrics | Every request |
http_request_duration_seconds_bucket |
API metrics histogram | Every request |
container_memory_usage_bytes |
cgroup metrics | 15 seconds |
container_spec_memory_limit_bytes |
cgroup metrics | Static |
Note: Metrics depend on /metrics endpoint exposure in API or sidecar exporters
-
Prometheus Retention: Currently undefined (uses default ~15 days)
- Consider:
--storage.tsdb.retention.time=2wfor 2-week history
- Consider:
-
Alert State: Stored in Grafana database (not persistent across pod restarts)
- For production: Use AlertManager for persistence
-
PromQL Limitations: No alert for "metric missing"
- Implement: Separate metric availability check
-
Alert Deduplication: Recommend implementing inhibit rules
- Example: If DB down, suppress "API error rate" alert
-
Notification Retry: Webhook failures are not retried
- Implement: Message queue (RabbitMQ, Kafka) for reliability
- Deploy:
kubectl apply -f k8s/22-grafana-alerts.yaml - Test:
./test-grafana-alerts.sh - Configure: Setup notification channels in Grafana UI
- Monitor: Watch for false positives over 2 weeks
- Tune: Adjust thresholds based on baseline data
- Document: Create team runbooks for each alert
- Automate: Implement webhook receiver in API
- Integrate: Connect to incident management (Jira, PagerDuty)
Questions? See docs/GRAFANA-ALERTS-CONFIGURATION.md for detailed explanations