A simple Todo CRUD API built with Flask and PostgreSQL, designed for demonstrating AI SRE capabilities including:
- Observability: Monitored via Grafana Cloud
- GitOps: Deployed via Argo CD
- Failure Simulation: Endpoints to trigger and remediate failures
- Database Operations: Full CRUD against PostgreSQL
| Method | Endpoint | Description |
|---|---|---|
| GET | /todos |
List all todos |
| POST | /todos |
Create a new todo |
| GET | /todos/<id> |
Get a specific todo |
| PUT | /todos/<id> |
Update a todo |
| DELETE | /todos/<id> |
Delete a todo |
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Service info and status |
| GET | /health |
Health check (used by K8s probes) |
| Method | Endpoint | Description |
|---|---|---|
| POST | /trigger-failure |
Put app in failure mode |
| POST | /remediate |
Reset failure state |
| POST | /crash |
Immediately crash the app |
# Create a todo
curl -X POST http://localhost:8080/todos \
-H "Content-Type: application/json" \
-d '{"title": "Learn AI SRE", "description": "Demo the AI SRE capabilities"}'
# List all todos
curl http://localhost:8080/todos
# Update a todo
curl -X PUT http://localhost:8080/todos/1 \
-H "Content-Type: application/json" \
-d '{"completed": true}'
# Delete a todo
curl -X DELETE http://localhost:8080/todos/1| Variable | Description | Default |
|---|---|---|
PORT |
HTTP server port | 8080 |
FORCE_HEALTHY |
Force healthy status (overrides failure state) | false |
ENABLE_BUGGY_FEATURE |
Enable buggy code that crashes app on startup | false |
DATABASE_HOST |
PostgreSQL host | ai-sre-postgres |
DATABASE_PORT |
PostgreSQL port | 5432 |
DATABASE_NAME |
Database name | todos |
DATABASE_USER |
Database user | postgres |
DATABASE_PASSWORD |
Database password | postgres |
-
ENABLE_BUGGY_FEATURE=true: Loadsbuggy_feature.pywhich contains an intentional bug. The app crashes immediately on startup with anAttributeError, causing Kubernetes to enter CrashLoopBackOff. This simulates a "bad deployment" scenario.To remediate: An AI agent must:
- Investigate logs to find the error (
AttributeError: 'NoneType' object has no attribute 'upper') - Locate
buggy_feature.pyin the repository - Fix the bug (initialize
datavariable properly) - Push the fix to trigger ArgoCD deployment
- Investigate logs to find the error (
-
FORCE_HEALTHY=true: Forces health checks to pass regardless of failure state. Use for manual remediation.
This application is deployed via Argo CD from the k8s/ directory.
Use the Makefile to build and push the Docker image (builds for linux/amd64 for GKE compatibility):
# Build only
make build
# Push only
make push
# Build and push in one step
make build-push
# Show all available commands
make helpThis project uses Kustomize's configMapGenerator to automatically trigger pod restarts when configuration changes. The ConfigMap name includes a content hash, so any change to kustomization.yaml triggers a rolling update.
# View generated ConfigMap name (includes hash suffix)
kubectl kustomize k8s/ | grep "name: ai-sre-demo-config"
# Output: name: ai-sre-demo-config-c58hct92t6To change configuration:
- Edit
k8s/kustomization.yaml→ updateconfigMapGenerator.literals - Commit and push
- ArgoCD syncs → new ConfigMap created → pods automatically restart
- Normal Operation: App serves todo CRUD, health checks pass
- Trigger Failure:
POST /trigger-failurecauses health checks to fail - Observe: Grafana alerts fire, notifications sent to MS Teams via OnCall
- Remediate:
POST /remediateor setFORCE_HEALTHY=truein ConfigMap
- Normal Operation: App is running healthy
- Trigger Failure: Change
ENABLE_BUGGY_FEATURE=trueink8s/kustomization.yaml, commit and push - ArgoCD Syncs: New ConfigMap hash triggers rolling update
- App Crashes: Pod enters CrashLoopBackOff due to bug in
buggy_feature.py - Alert Fires: Grafana detects CrashLoopBackOff, notifies MS Teams via OnCall
- AI Investigation: Agent uses MCP tools to check logs, find error, check config
- AI Fix: Agent changes
ENABLE_BUGGY_FEATURE=falsein kustomization.yaml, commits - ArgoCD Deploy: ArgoCD syncs, new ConfigMap triggers pod restart
- Recovery: App starts successfully, alerts resolve