Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
a0a0068
chore: Update ArgoCD applications to use argocd-gitops-dev-phase-1 br…
Ladas Nov 14, 2025
a81379e
fix: Exclude Kubernetes API server from Istio traffic capture in kage…
Ladas Nov 14, 2025
97a99d1
fix: Exclude Kubernetes API server from Istio traffic capture in kage…
Ladas Nov 14, 2025
cbd1d83
docs: Streamline README.md for clarity and accuracy
Ladas Nov 14, 2025
2d18486
Fix OAuth2-Proxy mTLS TLS_WRONG_VERSION_NUMBER errors
Ladas Nov 14, 2025
b03d93d
feat: Add placeholder GitHub token secret for dev environment
Ladas Nov 14, 2025
372a022
chore: Remove *-secret.yaml from .gitignore
Ladas Nov 14, 2025
47c216c
fix: Add kagenti-enabled label to team1 namespace
Ladas Nov 14, 2025
1aa8998
fix: Move github-token-secret to team1 namespace
Ladas Nov 14, 2025
bf18c2e
refactor: Remove duplicate Gateway API CRD file (10,612 lines)
Ladas Nov 14, 2025
3ec19be
feat: Add environments ConfigMap for team1 namespace
Ladas Nov 14, 2025
693e513
feat: Add Grafana Tempo Query (Jaeger UI) for trace visualization
Ladas Nov 14, 2025
5221dc0
fix: Add agent.kagenti.dev RBAC permissions to kagenti-ui
Ladas Nov 14, 2025
4f90570
:sparkles: Enhance quick-redeploy.sh with GitHub Actions PR detection
Ladas Nov 14, 2025
bb7b9d3
refactor: Use Grafana Explore for Tempo traces instead of Jaeger UI
Ladas Nov 14, 2025
b7d965a
:bug: Fix ArgoCD App State Validation workflow
Ladas Nov 14, 2025
239cb2c
:bug: Make ArgoCD install CI-friendly - tolerate login failure
Ladas Nov 14, 2025
1544a2a
fix: Configure working Grafana datasources for OTEL and Tempo
Ladas Nov 14, 2025
d90f498
:recycle: Simplify CI workflow to use quick-redeploy.sh
Ladas Nov 14, 2025
0292260
:memo: Update docs with CI mode and branch detection
Ladas Nov 14, 2025
fa78040
:bug: Make ArgoCD CLI login optional in install script
Ladas Nov 14, 2025
15f23ae
feat: Add Loki log aggregation with Promtail and Grafana integration
Ladas Nov 14, 2025
21792fc
:bug: Fix fork PR detection for ArgoCD deployment
Ladas Nov 14, 2025
8d62da9
fix: Fix Loki config and Promtail resource limits
Ladas Nov 14, 2025
3663828
:bug: Make manual ArgoCD sync optional, rely on automated sync
Ladas Nov 14, 2025
542da68
:alarm_clock: Increase CI wait time to 15 minutes for ArgoCD sync
Ladas Nov 14, 2025
46a9c8d
fix: Configure Loki compactor and reduce Promtail file watchers
Ladas Nov 14, 2025
7d74c87
:wrench: Increase file descriptor limits in Kind cluster
Ladas Nov 14, 2025
9adb328
:bug: Add WAL volume mount for Loki 3.0
Ladas Nov 14, 2025
eb5cecf
chore: Temporarily disable SPIRE and agents applications
Ladas Nov 14, 2025
751c66b
:white_check_mark: Trigger updated Platform Validation & E2E Tests wo…
Ladas Nov 14, 2025
54c00f1
:fire: Reduce Promtail file watchers by excluding infrastructure name…
Ladas Nov 14, 2025
74f2b15
:bug: Remove Grafana DestinationRule causing mTLS conflict
Ladas Nov 14, 2025
747d182
:alarm_clock: Replace fixed wait with intelligent ArgoCD app monitoring
Ladas Nov 14, 2025
c8825a4
:lock: Add PERMISSIVE mTLS policies for observability stack
Ladas Nov 14, 2025
fb3a1ca
:lock: Add DestinationRule for Grafana to disable mTLS from Gateway
Ladas Nov 14, 2025
be07872
:lock: Add NetworkPolicy egress rules for Grafana datasources
Ladas Nov 14, 2025
342ad7d
Add NetworkPolicy ingress rules for Grafana datasource access
Ladas Nov 14, 2025
2071d1d
:package: Add operator image tar files and improve CI workflow
Ladas Nov 14, 2025
08cc197
:bug: Fix operator tar loading in CI
Ladas Nov 14, 2025
f4edb95
Add DestinationRules to disable mTLS for OAuth2-Proxy services
Ladas Nov 14, 2025
f6067ac
:ambulance: Temporarily allow platform to be Progressing
Ladas Nov 14, 2025
0db3000
Add Grafana RBAC support with realm roles and protocol mappers
Ladas Nov 14, 2025
942c519
Add Prometheus server for metrics collection
Ladas Nov 14, 2025
bfaafe8
Add Prometheus server and Loki logs dashboard
Ladas Nov 14, 2025
dab2f76
:bug: Fix REQUIRED_APPS list - remove non-existent 'infrastructure' app
Ladas Nov 14, 2025
95930a0
:white_check_mark: Fix OTEL ConfigMap key in signal tests
Ladas Nov 14, 2025
7d96aa7
:white_check_mark: Fix Phoenix routing test to match actual architecture
Ladas Nov 14, 2025
ffb521f
:bug: Fix operator Applications not created in CI - move to base/
Ladas Nov 14, 2025
943d3d4
:white_check_mark: Fix kubectl exec to use subprocess for reliable st…
Ladas Nov 14, 2025
7b80436
:bug: Fix Prometheus K8s API connectivity by excluding port 443 from …
Ladas Nov 14, 2025
f2aad7c
:bug: Fix operator overlay remote kustomize URLs
Ladas Nov 14, 2025
d3e5a77
:lock: Add PERMISSIVE mTLS policy for Prometheus to accept HTTP conne…
Ladas Nov 14, 2025
26292e8
:bug: Disable Istio sidecar for Prometheus to allow plain HTTP from G…
Ladas Nov 14, 2025
5f83cab
:bug: Fix Grafana NetworkPolicy to allow egress to Prometheus
Ladas Nov 14, 2025
56c3dba
:bug: Add Prometheus NetworkPolicy to allow scraping OTEL Collector
Ladas Nov 14, 2025
107caaa
:rewind: Revert operator overlay URL changes - original paths were co…
Ladas Nov 14, 2025
6602743
:bug: Fix OTEL Collector NetworkPolicy to allow Prometheus ingress
Ladas Nov 14, 2025
e7e4d98
:bug: Fix Grafana admin password to match integration tests
Ladas Nov 14, 2025
1ee8816
:white_check_mark: Add operators back to REQUIRED_APPS list
Ladas Nov 14, 2025
d3c54f1
:white_check_mark: Fix Loki query timestamps for Alpine compatibility
Ladas Nov 14, 2025
dcde1f3
:bug: Fix regex matching bug in wait loop
Ladas Nov 14, 2025
07ce99a
:memo: Update TODO_CI.md with regex fix tracking
Ladas Nov 14, 2025
decc5d8
:bug: Fix Phoenix NetworkPolicy to allow Grafana ingress
Ladas Nov 14, 2025
36e96ac
fix(phoenix): Disable Istio sidecar injection for Phoenix
Ladas Nov 14, 2025
359f597
:bug: Fix operator exclusion bug in wait loop
Ladas Nov 14, 2025
f31f9f8
:lock: Add Phoenix egress rule to Grafana NetworkPolicy
Ladas Nov 14, 2025
32a1230
:memo: Document operator exclusion fix and CI progress
Ladas Nov 14, 2025
47e2a08
Add Phoenix DestinationRule to disable mTLS
Ladas Nov 14, 2025
00f20ea
:bug: Fix operator deployment timing issue in CI
Ladas Nov 14, 2025
3a01487
:memo: Mark ConfigMap key and kubectl exec issues as already fixed
Ladas Nov 14, 2025
e257729
:white_check_mark: Fix Prometheus query parsing in metrics end-to-end…
Ladas Nov 14, 2025
0c12841
:memo: Update TODO_TRACING.md to reflect 100% test pass rate
Ladas Nov 14, 2025
62cc636
:gear: Remove operators from REQUIRED_APPS in CI validation
Ladas Nov 14, 2025
36bf50b
:memo: Document OTEL signal flow testing completion (19/19 tests - 100%)
Ladas Nov 14, 2025
fcd27ad
Add Korrel8r signal correlation engine to observability stack
Ladas Nov 14, 2025
50252d3
Add Korrel8r to observability kustomization
Ladas Nov 14, 2025
fbca008
:white_check_mark: Fix pytest health check to match workflow logic
Ladas Nov 14, 2025
d1cb231
fix: Update Korrel8r image from v0.7.1 to latest
Ladas Nov 14, 2025
8f32f4f
:goal_net: Remove operators from pytest CRITICAL_APPS to match workflow
Ladas Nov 14, 2025
824a572
:bug: Fix pytest sync status check mismatch with workflow
Ladas Nov 14, 2025
dcbfdf3
docs: Add Alertmanager integration to TODO_TRACING.md
Ladas Nov 14, 2025
fe3de20
docs: Document Korrel8r CrashLoopBackOff issue and disable deployment
Ladas Nov 14, 2025
36f4d8d
feat: Add OTEL Collector processors for resource detection and baggage
Ladas Nov 14, 2025
98e4119
fix: Remove invalid kubernetes detector from OTEL resourcedetection
Ladas Nov 14, 2025
b4dc160
docs: Mark Phase 1.2 OTEL processors as completed
Ladas Nov 14, 2025
23ab6e0
feat: Add OpenTelemetry auto-instrumentation infrastructure
Ladas Nov 14, 2025
349e57f
fix: Add required collectorImage.repository to OTEL Operator config
Ladas Nov 14, 2025
3111fdb
fix: Move replicas out of manager section for OTEL Operator v0.72.0
Ladas Nov 14, 2025
16640a8
fix: Remove invalid featureGates for OTEL Operator v0.72.0
Ladas Nov 14, 2025
bdd0b4c
:bug: Add debug logging to capture crash logs from failing pods
Ladas Nov 15, 2025
fce1b67
:arrow_up: Upgrade Kind cluster to Kubernetes 1.28.0 to fix Tekton
Ladas Nov 15, 2025
2407d44
:sparkles: Add weather-service agent deployment
Ladas Nov 15, 2025
756c9f4
:sparkles: Enable agents ArgoCD application
Ladas Nov 15, 2025
fdc46c7
:bug: Fix weather-service UV cache permission error
Ladas Nov 16, 2025
6e7288b
:hammer: Build operator images natively in CI (fix AMD64 architecture)
Ladas Nov 16, 2025
f2b690a
:sparkles: Add HTTPRoutes for weather-service and research-agent
Ladas Nov 16, 2025
2835de2
:sparkles: Add Tekton pipeline templates for Component builds
Ladas Nov 16, 2025
ef8ba96
:bug: Fix operator images - build AMD64 BEFORE quick-redeploy
Ladas Nov 16, 2025
133f687
:bug: Fix operator tar export path - use GITHUB_WORKSPACE
Ladas Nov 16, 2025
6b65d5a
:bug: Fix kagenti-ui Streamlit multipage app routing and missing config
Ladas Nov 16, 2025
a17438e
:sparkles: Add Ollama LLM service with Qwen model
Ladas Nov 16, 2025
76c1351
:clock: Increase CI wait timeout from 10 to 15 minutes
Ladas Nov 16, 2025
17f9ab5
:clock: Increase CI wait timeout from 10 to 20 minutes
Ladas Nov 16, 2025
145e014
:memo: Update investigation summary with 20-minute timeout
Ladas Nov 16, 2025
fafa275
:white_check_mark: Add comprehensive e2e tests for weather agent with…
Ladas Nov 16, 2025
db5c696
Add Prometheus kubelet/cAdvisor scraping for container metrics
Ladas Nov 16, 2025
0744b1f
Add comprehensive Loki logs integration tests
Ladas Nov 16, 2025
256aaed
Design AlertManager + Korrel8r + Grafana alerting architecture
Ladas Nov 16, 2025
4772af3
Implement AlertManager for alert aggregation and routing
Ladas Nov 16, 2025
b6a36c3
Document complete observability implementation summary
Ladas Nov 16, 2025
b595cac
:tada: Final investigation results - SUCCESS!
Ladas Nov 16, 2025
da2461d
:memo: Correct investigation results - 17/19 apps healthy, root cause…
Ladas Nov 16, 2025
a777805
:memo: Add CI investigation final status report
Ladas Nov 16, 2025
07a76bf
:memo: Add operator rebuild instructions and SPIRE investigation notes
Ladas Nov 16, 2025
8883e31
feat(observability): Configure Grafana unified alerting with AlertMan…
Ladas Nov 16, 2025
ea2e0d0
:white_check_mark: Exclude resource-intensive apps from CI validation
Ladas Nov 16, 2025
7ae56b8
feat(observability): Add comprehensive platform health alerts and doc…
Ladas Nov 16, 2025
53c2373
docs(observability): Update implementation summary with Session 2 work
Ladas Nov 16, 2025
44a2ca5
:tada: Complete CI investigation and fixes - All critical apps passing
Ladas Nov 16, 2025
b4f0e68
feat(observability): Provision 24 platform health alerts in Grafana
Ladas Nov 16, 2025
7e4d794
:white_check_mark: Update E2E tests to support port-forwarding
Ladas Nov 16, 2025
03be70a
:checkered_flag: Archive TODO_CI.md - Investigation complete
Ladas Nov 16, 2025
1cbad82
test(observability): Add comprehensive alerting integration tests
Ladas Nov 16, 2025
3c9d8ed
:gear: Add Ollama LLM configuration to agents ConfigMap
Ladas Nov 16, 2025
1a1a76d
Fix OAuth2-Proxy DestinationRules - Remove incorrect TLS DISABLE
Ladas Nov 16, 2025
c80a51d
:white_check_mark: Add app exclusion support to E2E platform tests
Ladas Nov 16, 2025
8cf2671
Fix false positive alerts in Grafana alerting
Ladas Nov 16, 2025
36f6254
Fix final false positive alert (Istio Gateway) + comprehensive docs
Ladas Nov 16, 2025
2bcddb8
docs(alerts): Add comprehensive alert monitoring section to CLAUDE.md
Ladas Nov 16, 2025
d69c49a
:rocket: Extend CI timeout to 60min & enhance monitoring for complete…
Ladas Nov 16, 2025
d02be80
Fix Loki dashboard variables to use .+ instead of .*
Ladas Nov 16, 2025
95f4244
feat(alerts): Add comprehensive alert testing framework with mock data
Ladas Nov 16, 2025
154a39e
Fix pytest configuration issues preventing test discovery
Ladas Nov 16, 2025
15788b6
test(dashboard): Add comprehensive TDD tests for Loki dashboard data …
Ladas Nov 16, 2025
5734ef8
fix(dashboard): Replace $__rate_interval with fixed 5m interval for L…
Ladas Nov 16, 2025
72839f8
:bug: Fix CI monitoring to use enhanced script for complete platform …
Ladas Nov 16, 2025
8efc9ff
:bug: Fix monitoring script for CI - handle missing TERM variable
Ladas Nov 16, 2025
c01823c
Remove app exclusions from CI - test all platform apps
Ladas Nov 16, 2025
2f39201
:wrench: Add GitHub permissions and RAM monitoring to CI
Ladas Nov 17, 2025
09c9546
:camera_flash: Add platform status snapshot capture tool
Ladas Nov 17, 2025
372df7f
fix(alerts): Fix Keycloak false positive - use StatefulSet instead of…
Ladas Nov 17, 2025
f5f2a03
feat(grafana): Enhance Kubernetes Cluster dashboard with comprehensiv…
Ladas Nov 17, 2025
7e2ac8c
:sparkles: Add comprehensive monitoring agents implementation plan
Ladas Nov 17, 2025
80243ae
:bug: Fix ARM64 compatibility for Kind cluster creation
Ladas Nov 17, 2025
9990f76
:stopwatch: Add 5-minute grace period for Degraded apps in monitoring
Ladas Nov 17, 2025
d2df087
:wrench: Fix bash 3.2 compatibility in monitor script
Ladas Nov 17, 2025
dd5a2fe
fix: Correct agent registry port from localhost:5000 to localhost:5001
Ladas Nov 17, 2025
fbc9884
fix: Change AlertManager alert noDataState to OK
Ladas Nov 17, 2025
a65d150
:bug: Fix Keycloak readinessProbe to check token endpoint
Ladas Nov 17, 2025
9731d74
:bug: Revert Keycloak readinessProbe to use /health/ready
Ladas Nov 17, 2025
563436e
:memo: Refocus monitoring agents TODO on basic agent e2e testing first
Ladas Nov 18, 2025
24e2619
:sparkles: Add observability skills, monitoring improvements, and age…
Ladas Nov 18, 2025
a178947
:sparkles: Add agent import script via AgentBuild CRDs
Ladas Nov 18, 2025
4dae7fd
fix(loki): Add persistent storage and reduce namespace filtering
Ladas Nov 18, 2025
3bfdee9
:fire: Remove static agents ArgoCD application - agents will be impor…
Ladas Nov 18, 2025
5ca288f
:memo: Add Phase 0.5 status tracking document
Ladas Nov 18, 2025
b8bcaf5
:bug: Document webhook certificate validation bug - blocks AgentBuild…
Ladas Nov 18, 2025
0fe19e8
fix(loki): Add initContainer to fix hostPath permissions
Ladas Nov 18, 2025
c86aa20
:memo: Add comprehensive agent import documentation
Ladas Nov 18, 2025
d5d8661
:memo: Add webhook bypass workarounds for development testing
Ladas Nov 18, 2025
d5d3b75
:sparkles: Enhance monitoring script with degraded pod details
Ladas Nov 18, 2025
e24d7a9
:mag: Add error log extraction to degraded pod monitoring
Ladas Nov 18, 2025
e18f017
feat(grafana): Enable bi-directional trace\u2194log correlation
Ladas Nov 18, 2025
0ff88df
:bug: Fix webhook certificate validation by adding --webhook-cert-pat…
Ladas Nov 18, 2025
9f435b9
feat(grafana): Reorganize Kubernetes dashboard with collapsible secti…
Ladas Nov 18, 2025
5ceaf15
:bug: Fix webhook certificate by adding --webhook-cert-path to operat…
Ladas Nov 18, 2025
1b043a1
feat(grafana): Add error/warning analytics and alert correlation to L…
Ladas Nov 18, 2025
00c38ea
fix(grafana): Add missing tooltip/legend options to timeseries panels
Ladas Nov 18, 2025
e994065
:rocket: Reduce CPU resource requests to allow overcommit
Ladas Nov 18, 2025
046623c
feat(korrel8r): Re-enable with stable version 0.7.6
Ladas Nov 18, 2025
59b07bf
:memo: Update Korrel8r investigation - all versions fail with Go runt…
Ladas Nov 18, 2025
b962908
:bug: Re-disable Korrel8r after testing stable version 0.7.6
Ladas Nov 18, 2025
c8e906d
:sparkles: Add Korrel8r Operator ArgoCD application
Ladas Nov 18, 2025
1575c0b
:bug: Disable Korrel8r Operator - same Go runtime panic as Korrel8r s…
Ladas Nov 18, 2025
6c0d159
:bug: Disable Istio sidecar injection for Ollama
Ladas Nov 18, 2025
7858913
:memo: Update ArgoCD password retrieval command
Ladas Nov 18, 2025
4ee2bac
:bug: Fix Promtail 'too many open files' error
Ladas Nov 18, 2025
e7fb9ab
:bug: Improve monitor script failure reporting
Ladas Nov 18, 2025
bec1c0b
:wrench: Optimize Promtail file watching and add monitoring
Ladas Nov 18, 2025
179525d
:rocket: Replace deprecated Promtail with Grafana Alloy
Ladas Nov 18, 2025
418082c
:bug: Fix Alloy CrashLoopBackOff - add writable data volume
Ladas Nov 18, 2025
ca88268
:bug: Fix Alloy config syntax - update to v1.5.1 retry parameters
Ladas Nov 18, 2025
d84d85b
:bug: Fix Alloy config - remove unsupported timeout attribute
Ladas Nov 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions .claude/skills/check-alerts/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
---
name: check-alerts
description: Check currently firing Grafana alerts, analyze alert status, and investigate alert issues in the Kagenti platform
---

# Check Alerts Skill

This skill helps you check and analyze Grafana alerts in the Kagenti platform.

## When to Use

- User asks "what alerts are firing?"
- User wants to check alert status
- After platform changes or deployments
- During incident investigation
- When troubleshooting platform issues

## What This Skill Does

1. **List Firing Alerts**: Show all currently active alerts
2. **Alert Details**: Display alert severity, component, and description
3. **Alert History**: Check recent alert state changes
4. **Query Alert Rules**: Verify alert configuration
5. **Test Alert Queries**: Validate PromQL queries

## Examples

### Check Firing Alerts

```bash
# Get all currently firing alerts from Grafana
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/alertmanager/grafana/api/v2/alerts' \
-u admin:admin123 | python3 -c "
import sys, json
alerts = json.load(sys.stdin)
firing = [a for a in alerts if a.get('status', {}).get('state') == 'active']
print(f'Firing alerts: {len(firing)}')
for alert in firing:
labels = alert.get('labels', {})
annotations = alert.get('annotations', {})
print(f\"\\n• {labels.get('alertname')} ({labels.get('severity')})\")
print(f\" Component: {labels.get('component')}\")
print(f\" Description: {annotations.get('description', 'N/A')[:100]}...\")
"
```

### List All Alert Rules

```bash
# Get all configured alert rules
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
print(f'Total alert rules: {len(rules)}')
for rule in rules:
print(f\" • {rule.get('title')} ({rule.get('labels', {}).get('severity')})\")
"
```

### Check Specific Alert Configuration

```bash
# Get configuration for a specific alert
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down' # Change this to the alert UID
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
print(f\"Alert: {rule.get('title')}\")
print(f\"Query: {rule.get('data', [{}])[0].get('model', {}).get('expr')}\")
print(f\"noDataState: {rule.get('noDataState')}\")
print(f\"execErrState: {rule.get('execErrState')}\")
"
```

### Test Alert Query Against Prometheus

```bash
# Test an alert's PromQL query
QUERY='up{job="kubernetes-pods",app="prometheus"} == 0'

kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode "query=${QUERY}" | python3 -m json.tool
```

### Check Alert Evaluation State

```bash
# Check why an alert is firing or not firing
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/eval/rules' \
-u admin:admin123 | python3 -m json.tool
```

## Alert Locations in Grafana UI

**Access Grafana**: https://grafana.localtest.me:9443
**Credentials**: admin / admin123

**Navigation**:
1. **Alerting** → **Alert rules** - View all configured alerts
2. **Alerting** → **Alert list** - See firing/pending alerts
3. **Alerting** → **Silences** - Manage alert silences
4. **Alerting** → **Contact points** - Check notification settings
5. **Alerting** → **Notification policies** - View routing rules

## Common Alert Issues

### False Positives
- Check `noDataState` configuration (should be `OK` for most alerts)
- Verify query matches actual resource type (Deployment vs StatefulSet)
- Test query returns correct results

### Alert Not Firing When It Should
- Verify metric exists in Prometheus
- Check alert threshold is appropriate
- Verify `for` duration isn't too long
- Check `noDataState` isn't masking the issue

### Alert Configuration Not Loading
- Restart Grafana: `kubectl rollout restart deployment/grafana -n observability`
- Check ConfigMap applied: `kubectl get configmap grafana-alerting -n observability`
- Verify no YAML syntax errors

## Related Documentation

- [Alert Runbooks](../../../docs/runbooks/alerts/)
- [Alert Testing Guide](../../../docs/04-observability/ALERT_TESTING_GUIDE.md)
- [CLAUDE.md Alert Monitoring](../../../CLAUDE.md#alert-monitoring)
- [TODO_INCIDENTS.md](../../../TODO_INCIDENTS.md) - Current incident tracking

## Runbooks by Alert

When an alert fires, consult its runbook:
- `docs/runbooks/alerts/<alert-uid>.md`

Example: If "Prometheus Down" alert fires → `docs/runbooks/alerts/prometheus-down.md`

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Loading
Loading