Azure SRE Agent Setup Guide — Three Rivers Bank Demo

Goal: Deploy Azure SRE Agent to monitor the Three Rivers Bank application (backend + frontend on Azure Container Apps), connect to GitHub for code-aware root cause analysis, and enable the full agentic loop: detect → diagnose → create issue → auto-fix.

Prerequisites
Deploy the Application
Deploy the SRE Agent
Configure the SRE Agent (Post-Provision)
Verify Setup
Scheduled Tasks
Enable Copilot Coding Agent Auto-Fix
Chaos Engineering Scenarios
Demo Walkthrough
Troubleshooting
Reference Links

1. Prerequisites

Azure Requirements

Requirement	Details
Azure Subscription	Active subscription with billing enabled
Owner Role	Required on the subscription for RBAC role assignments
SRE Agent Region	`eastus2`, `swedencentral`, or `australiaeast`
Resource Provider	`Microsoft.App` registered: `az provider register -n Microsoft.App --wait`
Deployed Application	Three Rivers Bank backend + frontend on Azure Container Apps (see Section 2)

Tools Required

Tool	Install
Azure CLI 2.60+	`winget install Microsoft.AzureCLI`
Azure Developer CLI 1.9+	`winget install Microsoft.Azd`
jq	`winget install jqlang.jq`
Git	`winget install Git.Git`
gh CLI (optional)	`winget install GitHub.cli`

GitHub Requirements

Requirement	Details
Repository Access	Push access to `agentic-devops-demo` repository
Copilot License	GitHub Copilot Enterprise or Business (for Copilot Coding Agent)
Actions PR Permission	"Allow GitHub Actions to create and approve pull requests" (see below)

Enable GitHub Actions PR Creation

gh api repos/<owner>/<repo>/actions/permissions/workflow \
  -X PUT \
  -f default_workflow_permissions=read \
  -F can_approve_pull_request_reviews=true

Or via UI: Settings → Actions → General → Workflow permissions → check "Allow GitHub Actions to create and approve pull requests".

2. Deploy the Application

The application is deployed from the repository root using the existing Terraform infrastructure:

# From the repository root
az login
azd auth login
azd up

After deployment, export the required environment variables for the SRE Agent:

# Get the app resource group name from the deployment
APP_RESOURCE_GROUP=$(azd env get-value AZURE_RESOURCE_GROUP)
echo "App Resource Group: $APP_RESOURCE_GROUP"

Note: The application deployment creates the backend/frontend Container Apps, Container Registry, Log Analytics Workspace, and Container App Environment. The SRE Agent deployment (next section) creates only the agent itself in a separate resource group.

3. Deploy the SRE Agent

The SRE Agent is deployed from the sre/ directory using its own azd project with Bicep infrastructure.

What Gets Deployed

Resource	Type	Purpose
SRE Agent	`Microsoft.App/agents`	AI agent for incident investigation
User-Assigned Managed Identity	`Microsoft.ManagedIdentity`	Agent identity for Azure access
Action Group	`Microsoft.Insights/actionGroups`	Alert notification target
Alert: HTTP 5xx Spike	`Microsoft.Insights/metricAlerts`	Severity 2 — backend error rate
Alert: Container Restart	`Microsoft.Insights/metricAlerts`	Severity 1 — OOM/crash detection
Alert: High Response Time	`Microsoft.Insights/metricAlerts`	Severity 3 — latency monitoring

RBAC Roles Assigned (Subscription Scope)

Role	Purpose
Reader	Agent can read all resources
Monitoring Reader	Agent can read metrics and alerts
Monitoring Contributor	Agent can manage alert rules
Log Analytics Reader	Agent can query logs via KQL
Container Apps Contributor	Agent can manage container apps

Deploy

# Navigate to the SRE directory
cd sre

# Create a new azd environment
azd env new sre-three-rivers

# Set the application resource group (from Step 2)
azd env set APP_RESOURCE_GROUP "$APP_RESOURCE_GROUP"

# Set the backend container app name (from app deployment outputs)
azd env set BACKEND_CONTAINER_APP_NAME "$(cd .. && azd env get-value backend_service_name)"

# Set the azure location (must be eastus2, swedencentral, or australiaeast)
azd env set AZURE_LOCATION eastus2

# Deploy — select your subscription
azd up

Deployment takes ~3-5 minutes. The output will include:

SRE_AGENT_NAME — the agent resource name
SRE_AGENT_ENDPOINT — the agent's data plane API URL
AGENT_PORTAL_URL — link to https://sre.azure.com

4. Configure the SRE Agent (Post-Provision)

Customizing the GitHub Repository

By default, the script uses yortch/agentic-devops-demo. To use a different repo:

azd env set GITHUB_REPO "<owner>/<repo>"
bash scripts/post-provision.sh

After azd up completes, run the post-provision script to configure knowledge base, subagents, incident response plan, and GitHub integration:

# Still in the sre/ directory
bash scripts/post-provision.sh

What the Script Configures

Step	Component	Method
1	Knowledge Base — HTTP errors runbook + app architecture docs	Data plane API: `POST /api/v1/AgentMemory/upload`
2	Subagents — `incident-handler` + `code-analyzer`	Data plane API: `PUT /api/v2/extendedAgent/agents/{name}`
3	Scheduled Tasks — Health check (30min), config drift (6hr), daily reliability report (8am UTC)	Data plane API: `POST /api/v1/scheduledtasks`
4	Azure Monitor — Incident platform + response plan routing to `incident-handler`	ARM PATCH + data plane API
5	GitHub OAuth — Connector for code search and issue creation	Data plane + ARM API

GitHub Authorization (One-Time)

The script will print a GitHub OAuth URL at the end:

┌──────────────────────────────────────────────────────────────┐
│  Sign in to GitHub to authorize the SRE Agent:               │
│  https://...                                                 │
│  Open this URL in your browser and click 'Authorize'         │
└──────────────────────────────────────────────────────────────┘

Open this URL in your browser and authorize the agent. This is a one-time step.

Re-running the Script

# Re-run if any step failed
bash scripts/post-provision.sh --retry

# Check current status
bash scripts/post-provision.sh --status

5. Verify Setup

5.1 Check the Portal

Open sre.azure.com and click Full setup. You should see green checkmarks on:

Card	Expected Status
Code	✅ 1 repository
Incidents	✅ Connected to Azure Monitor
Azure resources	✅ 1 resource group added
Knowledge files	✅ 2 files
Scheduled tasks	✅ 3 tasks active

5.2 Test the Agent

In the agent chat, ask verification questions:

What subscriptions and resource groups are you managing?

List all container apps in the monitored resource groups.

What is the health status of the Three Rivers Bank backend container app?

5.3 Verify Alerts

# From the sre/ directory or anywhere with az login
az monitor metrics alert list \
  --resource-group "$(azd env get-value AZURE_RESOURCE_GROUP)" \
  --query "[].{Name:name, Severity:severity, Enabled:enabled}" \
  --output table

6. Scheduled Tasks

The post-provision script creates three automated scheduled tasks that run proactively:

Task	Schedule	Description
`three-rivers-health-check`	Every 30 minutes	Checks backend/frontend health, error rates, response times, container restarts, and infrastructure status. Escalates to `code-analyzer` + `incident-handler` if issues are found.
`three-rivers-config-drift`	Every 6 hours	Verifies environment variables, container resource limits, and image versions match expected values. Creates a GitHub issue via `incident-handler` if drift is detected.
`three-rivers-daily-reliability-report`	Daily at 8am UTC	Summarizes 24-hour metrics, checks 7-day degradation trends, correlates recent GitHub PRs with metric changes, and provides reliability recommendations.

Task definitions live in sre/sre-config/tasks/ and are automatically picked up by the post-provision script.

Verify Scheduled Tasks

In the SRE Agent Portal, navigate to Builder → Scheduled tasks and confirm all three tasks show status On.

7. Enable Copilot Coding Agent Auto-Fix

6.1 Create Required Labels

gh label create "sre-agent-detected" \
  --description "Issue detected by Azure SRE Agent" \
  --color "d73a4a"

gh label create "chaos-engineering" \
  --description "Issue introduced by chaos engineering workflow" \
  --color "fbca04"

6.2 Enable Copilot Coding Agent

Go to Settings → General → Features → ensure Copilot is enabled.
Enable Copilot Coding Agent under Copilot settings.

6.3 How It Works

When the SRE Agent creates an issue and assigns it to @copilot:

Copilot Coding Agent picks up the assigned issue
Copilot reads the issue (including RCA and affected code)
Copilot creates a fix PR
You review and merge

8. Chaos Engineering Scenarios

The chaos engineering workflow uses GitHub Agentic Workflows to inject code-level breaking changes via PR.

Available Scenarios

#	Scenario	Target	Symptoms
1	`port-mismatch`	Backend	503 errors, health check fails
2	`bad-api-url`	Backend	BIAN API fails, degraded data
3	`cors-broken`	Backend	Browser CORS errors, blank pages
4	`low-resources`	Backend	OOM kills, restarts, slow responses
5	`health-check-disabled`	Backend	Health probes fail, restart loop
6	`bad-image-tag`	Backend	Image pull fails, container stuck
7	`db-corruption`	Backend	Spring Boot fails, all 500 errors
8	`profile-wrong`	Backend	Wrong profile, behavioral drift
9	`circuit-breaker-disabled`	Backend	Requests hang, thread exhaustion
10	`frontend-api-broken`	Frontend	API calls fail, no card data
11	`backend-500`	Backend	GET /api/cards returns 500, NPE
12	`backend-image-tag`	Backend	Image pull fails, 503
13	`backend-slow-response`	Backend	GET /api/cards takes 0–9s

Install gh-aw

gh extension install github/gh-aw

Trigger via GitHub Issue

Create an issue with the chaos-engineering label to specify a scenario:

gh issue create \
  --title "chaos: backend-slow-response" \
  --body "Inject backend-slow-response scenario to test SRE Agent detection" \
  --label "chaos-engineering"

Then trigger the workflow:

gh workflow run chaos-engineering.lock.yml

The agentic workflow will:

Find the open issue with chaos-engineering label
Extract the scenario from the issue title
Apply the code-level breaking change
Create a PR with the chaos prefix

If no matching issue is found, a random scenario is selected.

9. Demo Walkthrough

┌─────────────────────────────────────────────────────┐
│  1. Introduce Chaos                                 │
│     └─ gh issue create → gh workflow run            │
│                                                     │
│  2. Breaking Change Takes Effect                    │
│     └─ CI/CD deploys broken code from PR            │
│                                                     │
│  3. SRE Agent Detects Issue                         │
│     └─ Alert fires → agent investigates             │
│     └─ Agent queries logs + code                    │
│                                                     │
│  4. RCA + GitHub Issue Created                      │
│     └─ Assigned to @copilot, label: sre-agent-detected│
│                                                     │
│  5. Copilot Coding Agent Fixes                      │
│     └─ Reads issue → creates fix PR                 │
│                                                     │
│  6. Fix Deployed + Health Restored                  │
└─────────────────────────────────────────────────────┘

Before the Demo

Verify app is healthy:

curl https://{backend-fqdn}/actuator/health
curl https://{frontend-fqdn}/

Open these windows:
- Azure Portal → SRE Agent chat (sre.azure.com)
- GitHub → Repository issues tab
- Terminal → Ready to trigger chaos

During the Demo

Part 1: Show Healthy App (~2 min)

Open Three Rivers Bank website, show card comparison working
Start the load generator to establish a healthy baseline:
```
bash sre/scripts/generate-load.sh --rps 2 --forever
```
In SRE Agent chat: "What is the current health of my container apps?"

Part 2: Introduce Chaos (~1 min)

gh issue create \
  --title "chaos: backend-slow-response" \
  --body "Inject backend-slow-response scenario" \
  --label "chaos-engineering"
gh workflow run chaos-engineering.lock.yml

Wait for the PR to be created and merged
Keep the load generator running — it will show errors as the chaos takes effect

Part 3: Wait for Detection (~3-5 min)

Watch the load generator output for rising error counts
Azure Monitor alert fires → SRE Agent investigates automatically
Or ask manually: "Check the health of the Three Rivers Bank backend. Is anything wrong?"

Part 4: Show RCA + Issue (~2 min)

The agent creates a GitHub issue with root cause, code references, and fix suggestion
Show the issue assigned to @copilot with sre-agent-detected label

Part 5: Copilot Fixes (~2-3 min)

Copilot Coding Agent picks up the issue → creates fix PR
Review and merge

Part 6: Recovery (~1 min)

CI/CD deploys fix → verify app healthy again
Watch the load generator output return to all green
SRE Agent confirms: "Is the application healthy now?"
Stop the load generator with Ctrl+C to see the summary

10. Troubleshooting

Problem	Solution
`azd up` fails in `sre/`	Ensure `APP_RESOURCE_GROUP` is set: `azd env set APP_RESOURCE_GROUP <name>`
`Microsoft.App not registered`	Run: `az provider register -n Microsoft.App --wait`
Post-provision script fails on response plan	Wait 30s and run: `bash scripts/post-provision.sh --retry`
SRE Agent can't see Container Apps	Verify `APP_RESOURCE_GROUP` matches your app deployment's resource group
GitHub OAuth URL not shown	Check SRE Agent Administrator role is assigned (script does this automatically)
GitHub connector shows disconnected	Re-run `bash scripts/post-provision.sh --retry` and re-authorize OAuth URL
Copilot doesn't pick up issues	Verify the issue is assigned to `@copilot` and Copilot Coding Agent is enabled
Alert rules not firing	Check alerts: `az monitor metrics alert list --resource-group $(azd env get-value AZURE_RESOURCE_GROUP)`
`roleAssignments/write` denied	Need Owner role on subscription

11. Reference Links

Resource	URL
Azure SRE Agent Overview	https://learn.microsoft.com/en-us/azure/sre-agent/overview
SRE Agent Portal	https://sre.azure.com
SRE Agent Lab (GitHub)	https://github.com/microsoft/sre-agent/tree/main/labs/starter-lab
Data Plane API Reference (post-provision.sh)	https://github.com/microsoft/sre-agent/blob/main/labs/starter-lab/scripts/post-provision.sh
Connect Source Code	https://learn.microsoft.com/en-us/azure/sre-agent/connect-source-code
Connectors	https://learn.microsoft.com/en-us/azure/sre-agent/connectors
Incident Response	https://learn.microsoft.com/en-us/azure/sre-agent/incident-response
GitHub Copilot Coding Agent	https://docs.github.com/en/copilot/using-github-copilot/using-copilot-coding-agent

Appendix: Directory Structure

sre/
├── azure.yaml                          # azd project config
├── infra/
│   ├── main.bicep                      # Entry point (subscription scope)
│   ├── resources.bicep                 # Resource orchestration
│   └── modules/
│       ├── sre-agent.bicep             # SRE Agent resource
│       ├── identity.bicep              # Managed Identity
│       ├── subscription-rbac.bicep     # RBAC role assignments
│       └── alert-rules.bicep           # Azure Monitor alerts (3 rules)
├── scripts/
│   ├── post-provision.sh               # Configures KB, subagents, tasks, response plan, GitHub
│   ├── chaos-engineering.sh            # Inject/rollback infrastructure faults via az CLI
│   └── generate-load.sh               # Generate light traffic against deployed app
├── sre-config/
│   ├── agents/
│   │   ├── incident-handler.yaml       # Investigates incidents, creates GitHub issues
│   │   └── code-analyzer.yaml          # Deep code root cause analysis
│   └── tasks/
│       ├── health-check.yaml           # Comprehensive health check every 30 minutes
│       ├── config-drift.yaml           # Configuration drift detection every 6 hours
│       └── daily-reliability-report.yaml # Daily reliability report at 8am UTC
└── knowledge-base/
    ├── error-investigation-runbook.md   # Error investigation runbook
    └── app-architecture.md             # Application architecture reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure SRE Agent Setup Guide — Three Rivers Bank Demo

Table of Contents

1. Prerequisites

Azure Requirements

Tools Required

GitHub Requirements

Enable GitHub Actions PR Creation

2. Deploy the Application

3. Deploy the SRE Agent

What Gets Deployed

RBAC Roles Assigned (Subscription Scope)

Deploy

4. Configure the SRE Agent (Post-Provision)

Customizing the GitHub Repository

What the Script Configures

GitHub Authorization (One-Time)

Re-running the Script

5. Verify Setup

5.1 Check the Portal

5.2 Test the Agent

5.3 Verify Alerts

6. Scheduled Tasks

Verify Scheduled Tasks

7. Enable Copilot Coding Agent Auto-Fix

6.1 Create Required Labels

6.2 Enable Copilot Coding Agent

6.3 How It Works

8. Chaos Engineering Scenarios

Available Scenarios

Install gh-aw

Trigger via GitHub Issue

9. Demo Walkthrough

Before the Demo

During the Demo

10. Troubleshooting

11. Reference Links

Appendix: Directory Structure

FilesExpand file tree

SRE-AGENT-SETUP.md

Latest commit

History

SRE-AGENT-SETUP.md

File metadata and controls

Azure SRE Agent Setup Guide — Three Rivers Bank Demo

Table of Contents

1. Prerequisites

Azure Requirements

Tools Required

GitHub Requirements

Enable GitHub Actions PR Creation

2. Deploy the Application

3. Deploy the SRE Agent

What Gets Deployed

RBAC Roles Assigned (Subscription Scope)

Deploy

4. Configure the SRE Agent (Post-Provision)

Customizing the GitHub Repository

What the Script Configures

GitHub Authorization (One-Time)

Re-running the Script

5. Verify Setup

5.1 Check the Portal

5.2 Test the Agent

5.3 Verify Alerts

6. Scheduled Tasks

Verify Scheduled Tasks

7. Enable Copilot Coding Agent Auto-Fix

6.1 Create Required Labels

6.2 Enable Copilot Coding Agent

6.3 How It Works

8. Chaos Engineering Scenarios

Available Scenarios

Install gh-aw

Trigger via GitHub Issue

9. Demo Walkthrough

Before the Demo

During the Demo

10. Troubleshooting

11. Reference Links

Appendix: Directory Structure