|
| 1 | +--- |
| 2 | +name: check-alerts |
| 3 | +description: Check currently firing Grafana alerts, analyze alert status, and investigate alert issues in the Kagenti platform |
| 4 | +--- |
| 5 | + |
| 6 | +# Check Alerts Skill |
| 7 | + |
| 8 | +This skill helps you check and analyze Grafana alerts in the Kagenti platform. |
| 9 | + |
| 10 | +## When to Use |
| 11 | + |
| 12 | +- User asks "what alerts are firing?" |
| 13 | +- User wants to check alert status |
| 14 | +- After platform changes or deployments |
| 15 | +- During incident investigation |
| 16 | +- When troubleshooting platform issues |
| 17 | + |
| 18 | +## What This Skill Does |
| 19 | + |
| 20 | +1. **List Firing Alerts**: Show all currently active alerts |
| 21 | +2. **Alert Details**: Display alert severity, component, and description |
| 22 | +3. **Alert History**: Check recent alert state changes |
| 23 | +4. **Query Alert Rules**: Verify alert configuration |
| 24 | +5. **Test Alert Queries**: Validate PromQL queries |
| 25 | + |
| 26 | +## Examples |
| 27 | + |
| 28 | +### Check Firing Alerts |
| 29 | + |
| 30 | +```bash |
| 31 | +# Get all currently firing alerts from Grafana |
| 32 | +kubectl exec -n observability deployment/grafana -- \ |
| 33 | + curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \ |
| 34 | + -u admin:admin123 | python3 -c " |
| 35 | +import sys, json |
| 36 | +alerts = json.load(sys.stdin) |
| 37 | +firing = [a for a in alerts if a.get('status', {}).get('state') == 'active'] |
| 38 | +print(f'Firing alerts: {len(firing)}') |
| 39 | +for alert in firing: |
| 40 | + labels = alert.get('labels', {}) |
| 41 | + annotations = alert.get('annotations', {}) |
| 42 | + print(f\"\\n• {labels.get('alertname')} ({labels.get('severity')})\") |
| 43 | + print(f\" Component: {labels.get('component')}\") |
| 44 | + print(f\" Description: {annotations.get('description', 'N/A')[:100]}...\") |
| 45 | +" |
| 46 | +``` |
| 47 | + |
| 48 | +### List All Alert Rules |
| 49 | + |
| 50 | +```bash |
| 51 | +# Get all configured alert rules |
| 52 | +kubectl exec -n observability deployment/grafana -- \ |
| 53 | + curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \ |
| 54 | + -u admin:admin123 | python3 -c " |
| 55 | +import sys, json |
| 56 | +rules = json.load(sys.stdin) |
| 57 | +print(f'Total alert rules: {len(rules)}') |
| 58 | +for rule in rules: |
| 59 | + print(f\" • {rule.get('title')} ({rule.get('labels', {}).get('severity')})\") |
| 60 | +" |
| 61 | +``` |
| 62 | + |
| 63 | +### Check Specific Alert Configuration |
| 64 | + |
| 65 | +```bash |
| 66 | +# Get configuration for a specific alert |
| 67 | +kubectl exec -n observability deployment/grafana -- \ |
| 68 | + curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \ |
| 69 | + -u admin:admin123 | python3 -c " |
| 70 | +import sys, json |
| 71 | +rules = json.load(sys.stdin) |
| 72 | +alert_uid = 'prometheus-down' # Change this to the alert UID |
| 73 | +rule = next((r for r in rules if r.get('uid') == alert_uid), None) |
| 74 | +if rule: |
| 75 | + print(f\"Alert: {rule.get('title')}\") |
| 76 | + print(f\"Query: {rule.get('data', [{}])[0].get('model', {}).get('expr')}\") |
| 77 | + print(f\"noDataState: {rule.get('noDataState')}\") |
| 78 | + print(f\"execErrState: {rule.get('execErrState')}\") |
| 79 | +" |
| 80 | +``` |
| 81 | + |
| 82 | +### Test Alert Query Against Prometheus |
| 83 | + |
| 84 | +```bash |
| 85 | +# Test an alert's PromQL query |
| 86 | +QUERY='up{job="kubernetes-pods",app="prometheus"} == 0' |
| 87 | + |
| 88 | +kubectl exec -n observability deployment/grafana -- \ |
| 89 | + curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \ |
| 90 | + --data-urlencode "query=${QUERY}" | python3 -m json.tool |
| 91 | +``` |
| 92 | + |
| 93 | +### Check Alert Evaluation State |
| 94 | + |
| 95 | +```bash |
| 96 | +# Check why an alert is firing or not firing |
| 97 | +kubectl exec -n observability deployment/grafana -- \ |
| 98 | + curl -s 'http://localhost:3000/api/v1/eval/rules' \ |
| 99 | + -u admin:admin123 | python3 -m json.tool |
| 100 | +``` |
| 101 | + |
| 102 | +## Alert Locations in Grafana UI |
| 103 | + |
| 104 | +**Access Grafana**: https://grafana.localtest.me:9443 |
| 105 | +**Credentials**: admin / admin123 |
| 106 | + |
| 107 | +**Navigation**: |
| 108 | +1. **Alerting** → **Alert rules** - View all configured alerts |
| 109 | +2. **Alerting** → **Alert list** - See firing/pending alerts |
| 110 | +3. **Alerting** → **Silences** - Manage alert silences |
| 111 | +4. **Alerting** → **Contact points** - Check notification settings |
| 112 | +5. **Alerting** → **Notification policies** - View routing rules |
| 113 | + |
| 114 | +## Common Alert Issues |
| 115 | + |
| 116 | +### False Positives |
| 117 | +- Check `noDataState` configuration (should be `OK` for most alerts) |
| 118 | +- Verify query matches actual resource type (Deployment vs StatefulSet) |
| 119 | +- Test query returns correct results |
| 120 | + |
| 121 | +### Alert Not Firing When It Should |
| 122 | +- Verify metric exists in Prometheus |
| 123 | +- Check alert threshold is appropriate |
| 124 | +- Verify `for` duration isn't too long |
| 125 | +- Check `noDataState` isn't masking the issue |
| 126 | + |
| 127 | +### Alert Configuration Not Loading |
| 128 | +- Restart Grafana: `kubectl rollout restart deployment/grafana -n observability` |
| 129 | +- Check ConfigMap applied: `kubectl get configmap grafana-alerting -n observability` |
| 130 | +- Verify no YAML syntax errors |
| 131 | + |
| 132 | +## Related Documentation |
| 133 | + |
| 134 | +- [Alert Runbooks](../../../docs/runbooks/alerts/) |
| 135 | +- [Alert Testing Guide](../../../docs/04-observability/ALERT_TESTING_GUIDE.md) |
| 136 | +- [CLAUDE.md Alert Monitoring](../../../CLAUDE.md#alert-monitoring) |
| 137 | +- [TODO_INCIDENTS.md](../../../TODO_INCIDENTS.md) - Current incident tracking |
| 138 | + |
| 139 | +## Runbooks by Alert |
| 140 | + |
| 141 | +When an alert fires, consult its runbook: |
| 142 | +- `docs/runbooks/alerts/<alert-uid>.md` |
| 143 | + |
| 144 | +Example: If "Prometheus Down" alert fires → `docs/runbooks/alerts/prometheus-down.md` |
| 145 | + |
| 146 | +🤖 Generated with [Claude Code](https://claude.com/claude-code) |
0 commit comments