Skip to content

Commit 6a60f0c

Browse files
authored
Merge pull request #2 from Ladas/argocd-gitops-dev-phase-1
Argocd gitops dev phase 1
2 parents d68618e + d84d85b commit 6a60f0c

File tree

194 files changed

+45872
-11837
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

194 files changed

+45872
-11837
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
name: check-alerts
3+
description: Check currently firing Grafana alerts, analyze alert status, and investigate alert issues in the Kagenti platform
4+
---
5+
6+
# Check Alerts Skill
7+
8+
This skill helps you check and analyze Grafana alerts in the Kagenti platform.
9+
10+
## When to Use
11+
12+
- User asks "what alerts are firing?"
13+
- User wants to check alert status
14+
- After platform changes or deployments
15+
- During incident investigation
16+
- When troubleshooting platform issues
17+
18+
## What This Skill Does
19+
20+
1. **List Firing Alerts**: Show all currently active alerts
21+
2. **Alert Details**: Display alert severity, component, and description
22+
3. **Alert History**: Check recent alert state changes
23+
4. **Query Alert Rules**: Verify alert configuration
24+
5. **Test Alert Queries**: Validate PromQL queries
25+
26+
## Examples
27+
28+
### Check Firing Alerts
29+
30+
```bash
31+
# Get all currently firing alerts from Grafana
32+
kubectl exec -n observability deployment/grafana -- \
33+
curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
34+
-u admin:admin123 | python3 -c "
35+
import sys, json
36+
alerts = json.load(sys.stdin)
37+
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
38+
print(f'Firing alerts: {len(firing)}')
39+
for alert in firing:
40+
labels = alert.get('labels', {})
41+
annotations = alert.get('annotations', {})
42+
print(f\"\\n• {labels.get('alertname')} ({labels.get('severity')})\")
43+
print(f\" Component: {labels.get('component')}\")
44+
print(f\" Description: {annotations.get('description', 'N/A')[:100]}...\")
45+
"
46+
```
47+
48+
### List All Alert Rules
49+
50+
```bash
51+
# Get all configured alert rules
52+
kubectl exec -n observability deployment/grafana -- \
53+
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
54+
-u admin:admin123 | python3 -c "
55+
import sys, json
56+
rules = json.load(sys.stdin)
57+
print(f'Total alert rules: {len(rules)}')
58+
for rule in rules:
59+
print(f\" • {rule.get('title')} ({rule.get('labels', {}).get('severity')})\")
60+
"
61+
```
62+
63+
### Check Specific Alert Configuration
64+
65+
```bash
66+
# Get configuration for a specific alert
67+
kubectl exec -n observability deployment/grafana -- \
68+
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
69+
-u admin:admin123 | python3 -c "
70+
import sys, json
71+
rules = json.load(sys.stdin)
72+
alert_uid = 'prometheus-down' # Change this to the alert UID
73+
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
74+
if rule:
75+
print(f\"Alert: {rule.get('title')}\")
76+
print(f\"Query: {rule.get('data', [{}])[0].get('model', {}).get('expr')}\")
77+
print(f\"noDataState: {rule.get('noDataState')}\")
78+
print(f\"execErrState: {rule.get('execErrState')}\")
79+
"
80+
```
81+
82+
### Test Alert Query Against Prometheus
83+
84+
```bash
85+
# Test an alert's PromQL query
86+
QUERY='up{job="kubernetes-pods",app="prometheus"} == 0'
87+
88+
kubectl exec -n observability deployment/grafana -- \
89+
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
90+
--data-urlencode "query=${QUERY}" | python3 -m json.tool
91+
```
92+
93+
### Check Alert Evaluation State
94+
95+
```bash
96+
# Check why an alert is firing or not firing
97+
kubectl exec -n observability deployment/grafana -- \
98+
curl -s 'http://localhost:3000/api/v1/eval/rules' \
99+
-u admin:admin123 | python3 -m json.tool
100+
```
101+
102+
## Alert Locations in Grafana UI
103+
104+
**Access Grafana**: https://grafana.localtest.me:9443
105+
**Credentials**: admin / admin123
106+
107+
**Navigation**:
108+
1. **Alerting****Alert rules** - View all configured alerts
109+
2. **Alerting****Alert list** - See firing/pending alerts
110+
3. **Alerting****Silences** - Manage alert silences
111+
4. **Alerting****Contact points** - Check notification settings
112+
5. **Alerting****Notification policies** - View routing rules
113+
114+
## Common Alert Issues
115+
116+
### False Positives
117+
- Check `noDataState` configuration (should be `OK` for most alerts)
118+
- Verify query matches actual resource type (Deployment vs StatefulSet)
119+
- Test query returns correct results
120+
121+
### Alert Not Firing When It Should
122+
- Verify metric exists in Prometheus
123+
- Check alert threshold is appropriate
124+
- Verify `for` duration isn't too long
125+
- Check `noDataState` isn't masking the issue
126+
127+
### Alert Configuration Not Loading
128+
- Restart Grafana: `kubectl rollout restart deployment/grafana -n observability`
129+
- Check ConfigMap applied: `kubectl get configmap grafana-alerting -n observability`
130+
- Verify no YAML syntax errors
131+
132+
## Related Documentation
133+
134+
- [Alert Runbooks](../../../docs/runbooks/alerts/)
135+
- [Alert Testing Guide](../../../docs/04-observability/ALERT_TESTING_GUIDE.md)
136+
- [CLAUDE.md Alert Monitoring](../../../CLAUDE.md#alert-monitoring)
137+
- [TODO_INCIDENTS.md](../../../TODO_INCIDENTS.md) - Current incident tracking
138+
139+
## Runbooks by Alert
140+
141+
When an alert fires, consult its runbook:
142+
- `docs/runbooks/alerts/<alert-uid>.md`
143+
144+
Example: If "Prometheus Down" alert fires → `docs/runbooks/alerts/prometheus-down.md`
145+
146+
🤖 Generated with [Claude Code](https://claude.com/claude-code)

0 commit comments

Comments
 (0)