Goal: Deploy Azure SRE Agent to monitor the Three Rivers Bank application (backend + frontend on Azure Container Apps), connect to GitHub for code-aware root cause analysis, and enable the full agentic loop: detect → diagnose → create issue → auto-fix.
- Prerequisites
- Deploy the Application
- Deploy the SRE Agent
- Configure the SRE Agent (Post-Provision)
- Verify Setup
- Scheduled Tasks
- Enable Copilot Coding Agent Auto-Fix
- Chaos Engineering Scenarios
- Demo Walkthrough
- Troubleshooting
- Reference Links
| Requirement | Details |
|---|---|
| Azure Subscription | Active subscription with billing enabled |
| Owner Role | Required on the subscription for RBAC role assignments |
| SRE Agent Region | eastus2, swedencentral, or australiaeast |
| Resource Provider | Microsoft.App registered: az provider register -n Microsoft.App --wait |
| Deployed Application | Three Rivers Bank backend + frontend on Azure Container Apps (see Section 2) |
| Tool | Install |
|---|---|
| Azure CLI 2.60+ | winget install Microsoft.AzureCLI |
| Azure Developer CLI 1.9+ | winget install Microsoft.Azd |
| jq | winget install jqlang.jq |
| Git | winget install Git.Git |
| gh CLI (optional) | winget install GitHub.cli |
| Requirement | Details |
|---|---|
| Repository Access | Push access to agentic-devops-demo repository |
| Copilot License | GitHub Copilot Enterprise or Business (for Copilot Coding Agent) |
| Actions PR Permission | "Allow GitHub Actions to create and approve pull requests" (see below) |
gh api repos/<owner>/<repo>/actions/permissions/workflow \
-X PUT \
-f default_workflow_permissions=read \
-F can_approve_pull_request_reviews=trueOr via UI: Settings → Actions → General → Workflow permissions → check "Allow GitHub Actions to create and approve pull requests".
The application is deployed from the repository root using the existing Terraform infrastructure:
# From the repository root
az login
azd auth login
azd upAfter deployment, export the required environment variables for the SRE Agent:
# Get the app resource group name from the deployment
APP_RESOURCE_GROUP=$(azd env get-value AZURE_RESOURCE_GROUP)
echo "App Resource Group: $APP_RESOURCE_GROUP"Note: The application deployment creates the backend/frontend Container Apps, Container Registry, Log Analytics Workspace, and Container App Environment. The SRE Agent deployment (next section) creates only the agent itself in a separate resource group.
The SRE Agent is deployed from the sre/ directory using its own azd project with Bicep infrastructure.
| Resource | Type | Purpose |
|---|---|---|
| SRE Agent | Microsoft.App/agents |
AI agent for incident investigation |
| User-Assigned Managed Identity | Microsoft.ManagedIdentity |
Agent identity for Azure access |
| Action Group | Microsoft.Insights/actionGroups |
Alert notification target |
| Alert: HTTP 5xx Spike | Microsoft.Insights/metricAlerts |
Severity 2 — backend error rate |
| Alert: Container Restart | Microsoft.Insights/metricAlerts |
Severity 1 — OOM/crash detection |
| Alert: High Response Time | Microsoft.Insights/metricAlerts |
Severity 3 — latency monitoring |
| Role | Purpose |
|---|---|
| Reader | Agent can read all resources |
| Monitoring Reader | Agent can read metrics and alerts |
| Monitoring Contributor | Agent can manage alert rules |
| Log Analytics Reader | Agent can query logs via KQL |
| Container Apps Contributor | Agent can manage container apps |
# Navigate to the SRE directory
cd sre
# Create a new azd environment
azd env new sre-three-rivers
# Set the application resource group (from Step 2)
azd env set APP_RESOURCE_GROUP "$APP_RESOURCE_GROUP"
# Set the backend container app name (from app deployment outputs)
azd env set BACKEND_CONTAINER_APP_NAME "$(cd .. && azd env get-value backend_service_name)"
# Set the azure location (must be eastus2, swedencentral, or australiaeast)
azd env set AZURE_LOCATION eastus2
# Deploy — select your subscription
azd upDeployment takes ~3-5 minutes. The output will include:
SRE_AGENT_NAME— the agent resource nameSRE_AGENT_ENDPOINT— the agent's data plane API URLAGENT_PORTAL_URL— link to https://sre.azure.com
By default, the script uses yortch/agentic-devops-demo. To use a different repo:
azd env set GITHUB_REPO "<owner>/<repo>"
bash scripts/post-provision.shAfter azd up completes, run the post-provision script to configure knowledge base, subagents, incident response plan, and GitHub integration:
# Still in the sre/ directory
bash scripts/post-provision.sh| Step | Component | Method |
|---|---|---|
| 1 | Knowledge Base — HTTP errors runbook + app architecture docs | Data plane API: POST /api/v1/AgentMemory/upload |
| 2 | Subagents — incident-handler + code-analyzer |
Data plane API: PUT /api/v2/extendedAgent/agents/{name} |
| 3 | Scheduled Tasks — Health check (30min), config drift (6hr), daily reliability report (8am UTC) | Data plane API: POST /api/v1/scheduledtasks |
| 4 | Azure Monitor — Incident platform + response plan routing to incident-handler |
ARM PATCH + data plane API |
| 5 | GitHub OAuth — Connector for code search and issue creation | Data plane + ARM API |
The script will print a GitHub OAuth URL at the end:
┌──────────────────────────────────────────────────────────────┐
│ Sign in to GitHub to authorize the SRE Agent: │
│ https://... │
│ Open this URL in your browser and click 'Authorize' │
└──────────────────────────────────────────────────────────────┘
Open this URL in your browser and authorize the agent. This is a one-time step.
# Re-run if any step failed
bash scripts/post-provision.sh --retry
# Check current status
bash scripts/post-provision.sh --statusOpen sre.azure.com and click Full setup. You should see green checkmarks on:
| Card | Expected Status |
|---|---|
| Code | ✅ 1 repository |
| Incidents | ✅ Connected to Azure Monitor |
| Azure resources | ✅ 1 resource group added |
| Knowledge files | ✅ 2 files |
| Scheduled tasks | ✅ 3 tasks active |
In the agent chat, ask verification questions:
What subscriptions and resource groups are you managing?
List all container apps in the monitored resource groups.
What is the health status of the Three Rivers Bank backend container app?
# From the sre/ directory or anywhere with az login
az monitor metrics alert list \
--resource-group "$(azd env get-value AZURE_RESOURCE_GROUP)" \
--query "[].{Name:name, Severity:severity, Enabled:enabled}" \
--output tableThe post-provision script creates three automated scheduled tasks that run proactively:
| Task | Schedule | Description |
|---|---|---|
three-rivers-health-check |
Every 30 minutes | Checks backend/frontend health, error rates, response times, container restarts, and infrastructure status. Escalates to code-analyzer + incident-handler if issues are found. |
three-rivers-config-drift |
Every 6 hours | Verifies environment variables, container resource limits, and image versions match expected values. Creates a GitHub issue via incident-handler if drift is detected. |
three-rivers-daily-reliability-report |
Daily at 8am UTC | Summarizes 24-hour metrics, checks 7-day degradation trends, correlates recent GitHub PRs with metric changes, and provides reliability recommendations. |
Task definitions live in sre/sre-config/tasks/ and are automatically picked up by the post-provision script.
In the SRE Agent Portal, navigate to Builder → Scheduled tasks and confirm all three tasks show status On.
gh label create "sre-agent-detected" \
--description "Issue detected by Azure SRE Agent" \
--color "d73a4a"
gh label create "chaos-engineering" \
--description "Issue introduced by chaos engineering workflow" \
--color "fbca04"- Go to Settings → General → Features → ensure Copilot is enabled.
- Enable Copilot Coding Agent under Copilot settings.
When the SRE Agent creates an issue and assigns it to @copilot:
- Copilot Coding Agent picks up the assigned issue
- Copilot reads the issue (including RCA and affected code)
- Copilot creates a fix PR
- You review and merge
The chaos engineering workflow uses GitHub Agentic Workflows to inject code-level breaking changes via PR.
| # | Scenario | Target | Symptoms |
|---|---|---|---|
| 1 | port-mismatch |
Backend | 503 errors, health check fails |
| 2 | bad-api-url |
Backend | BIAN API fails, degraded data |
| 3 | cors-broken |
Backend | Browser CORS errors, blank pages |
| 4 | low-resources |
Backend | OOM kills, restarts, slow responses |
| 5 | health-check-disabled |
Backend | Health probes fail, restart loop |
| 6 | bad-image-tag |
Backend | Image pull fails, container stuck |
| 7 | db-corruption |
Backend | Spring Boot fails, all 500 errors |
| 8 | profile-wrong |
Backend | Wrong profile, behavioral drift |
| 9 | circuit-breaker-disabled |
Backend | Requests hang, thread exhaustion |
| 10 | frontend-api-broken |
Frontend | API calls fail, no card data |
| 11 | backend-500 |
Backend | GET /api/cards returns 500, NPE |
| 12 | backend-image-tag |
Backend | Image pull fails, 503 |
| 13 | backend-slow-response |
Backend | GET /api/cards takes 0–9s |
gh extension install github/gh-awCreate an issue with the chaos-engineering label to specify a scenario:
gh issue create \
--title "chaos: backend-slow-response" \
--body "Inject backend-slow-response scenario to test SRE Agent detection" \
--label "chaos-engineering"Then trigger the workflow:
gh workflow run chaos-engineering.lock.ymlThe agentic workflow will:
- Find the open issue with
chaos-engineeringlabel - Extract the scenario from the issue title
- Apply the code-level breaking change
- Create a PR with the
chaosprefix
If no matching issue is found, a random scenario is selected.
┌─────────────────────────────────────────────────────┐
│ 1. Introduce Chaos │
│ └─ gh issue create → gh workflow run │
│ │
│ 2. Breaking Change Takes Effect │
│ └─ CI/CD deploys broken code from PR │
│ │
│ 3. SRE Agent Detects Issue │
│ └─ Alert fires → agent investigates │
│ └─ Agent queries logs + code │
│ │
│ 4. RCA + GitHub Issue Created │
│ └─ Assigned to @copilot, label: sre-agent-detected│
│ │
│ 5. Copilot Coding Agent Fixes │
│ └─ Reads issue → creates fix PR │
│ │
│ 6. Fix Deployed + Health Restored │
└─────────────────────────────────────────────────────┘
-
Verify app is healthy:
curl https://{backend-fqdn}/actuator/health curl https://{frontend-fqdn}/ -
Open these windows:
- Azure Portal → SRE Agent chat (sre.azure.com)
- GitHub → Repository issues tab
- Terminal → Ready to trigger chaos
Part 1: Show Healthy App (~2 min)
- Open Three Rivers Bank website, show card comparison working
- Start the load generator to establish a healthy baseline:
bash sre/scripts/generate-load.sh --rps 2 --forever
- In SRE Agent chat: "What is the current health of my container apps?"
Part 2: Introduce Chaos (~1 min)
gh issue create \
--title "chaos: backend-slow-response" \
--body "Inject backend-slow-response scenario" \
--label "chaos-engineering"
gh workflow run chaos-engineering.lock.yml- Wait for the PR to be created and merged
- Keep the load generator running — it will show errors as the chaos takes effect
Part 3: Wait for Detection (~3-5 min)
- Watch the load generator output for rising error counts
- Azure Monitor alert fires → SRE Agent investigates automatically
- Or ask manually: "Check the health of the Three Rivers Bank backend. Is anything wrong?"
Part 4: Show RCA + Issue (~2 min)
- The agent creates a GitHub issue with root cause, code references, and fix suggestion
- Show the issue assigned to
@copilotwithsre-agent-detectedlabel
Part 5: Copilot Fixes (~2-3 min)
- Copilot Coding Agent picks up the issue → creates fix PR
- Review and merge
Part 6: Recovery (~1 min)
- CI/CD deploys fix → verify app healthy again
- Watch the load generator output return to all green
- SRE Agent confirms: "Is the application healthy now?"
- Stop the load generator with
Ctrl+Cto see the summary
| Problem | Solution |
|---|---|
azd up fails in sre/ |
Ensure APP_RESOURCE_GROUP is set: azd env set APP_RESOURCE_GROUP <name> |
Microsoft.App not registered |
Run: az provider register -n Microsoft.App --wait |
| Post-provision script fails on response plan | Wait 30s and run: bash scripts/post-provision.sh --retry |
| SRE Agent can't see Container Apps | Verify APP_RESOURCE_GROUP matches your app deployment's resource group |
| GitHub OAuth URL not shown | Check SRE Agent Administrator role is assigned (script does this automatically) |
| GitHub connector shows disconnected | Re-run bash scripts/post-provision.sh --retry and re-authorize OAuth URL |
| Copilot doesn't pick up issues | Verify the issue is assigned to @copilot and Copilot Coding Agent is enabled |
| Alert rules not firing | Check alerts: az monitor metrics alert list --resource-group $(azd env get-value AZURE_RESOURCE_GROUP) |
roleAssignments/write denied |
Need Owner role on subscription |
| Resource | URL |
|---|---|
| Azure SRE Agent Overview | https://learn.microsoft.com/en-us/azure/sre-agent/overview |
| SRE Agent Portal | https://sre.azure.com |
| SRE Agent Lab (GitHub) | https://github.com/microsoft/sre-agent/tree/main/labs/starter-lab |
| Data Plane API Reference (post-provision.sh) | https://github.com/microsoft/sre-agent/blob/main/labs/starter-lab/scripts/post-provision.sh |
| Connect Source Code | https://learn.microsoft.com/en-us/azure/sre-agent/connect-source-code |
| Connectors | https://learn.microsoft.com/en-us/azure/sre-agent/connectors |
| Incident Response | https://learn.microsoft.com/en-us/azure/sre-agent/incident-response |
| GitHub Copilot Coding Agent | https://docs.github.com/en/copilot/using-github-copilot/using-copilot-coding-agent |
sre/
├── azure.yaml # azd project config
├── infra/
│ ├── main.bicep # Entry point (subscription scope)
│ ├── resources.bicep # Resource orchestration
│ └── modules/
│ ├── sre-agent.bicep # SRE Agent resource
│ ├── identity.bicep # Managed Identity
│ ├── subscription-rbac.bicep # RBAC role assignments
│ └── alert-rules.bicep # Azure Monitor alerts (3 rules)
├── scripts/
│ ├── post-provision.sh # Configures KB, subagents, tasks, response plan, GitHub
│ ├── chaos-engineering.sh # Inject/rollback infrastructure faults via az CLI
│ └── generate-load.sh # Generate light traffic against deployed app
├── sre-config/
│ ├── agents/
│ │ ├── incident-handler.yaml # Investigates incidents, creates GitHub issues
│ │ └── code-analyzer.yaml # Deep code root cause analysis
│ └── tasks/
│ ├── health-check.yaml # Comprehensive health check every 30 minutes
│ ├── config-drift.yaml # Configuration drift detection every 6 hours
│ └── daily-reliability-report.yaml # Daily reliability report at 8am UTC
└── knowledge-base/
├── error-investigation-runbook.md # Error investigation runbook
└── app-architecture.md # Application architecture reference