Observability Agent with Amazon Bedrock AgentCore

An AI-powered observability agent that helps Site Reliability Engineers (SREs) investigate incidents and reduce Mean Time to Resolution (MTTR). Built with Amazon Bedrock AgentCore and the Strands Agent SDK.

This sample implements the architecture from Reduce Mean Time to Resolution with an observability agent.

Architecture

The agent queries three data sources to investigate incidents:

Logs - Application logs stored in Amazon OpenSearch Serverless
Traces - Distributed traces stored in Amazon OpenSearch Serverless
Metrics - Infrastructure metrics stored in Amazon Managed Service for Prometheus

Choose Your Path

Option A: Quick Start (Build from Scratch)

Best for: Learning, POC, testing the agent with sample data.

This path creates all required AWS resources and populates them with test data simulating a payment service failure.

Time: ~15 minutes | Cost: ~$5/day for OpenSearch Serverless

→ Quick Start Guide

Option B: Integrate with Existing Infrastructure

Best for: Production use with your existing observability stack.

This path connects the agent to your existing OpenSearch and Prometheus deployments.

Time: ~10 minutes | Prerequisites: Existing OpenSearch + Prometheus

→ Integration Guide

Quick Start (Build from Scratch)

Prerequisites

AWS account with appropriate permissions
Python 3.11+
AWS CLI configured with credentials

Step 1: Clone and Setup

git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Step 2: Create OpenSearch Serverless Collection

# Set your AWS account ID
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=us-east-1

# Create encryption policy
aws opensearchserverless create-security-policy \
  --name observability-enc \
  --type encryption \
  --policy '{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]}],"AWSOwnedKey":true}'

# Create network policy
aws opensearchserverless create-security-policy \
  --name observability-net \
  --type network \
  --policy '[{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]},{"ResourceType":"dashboard","Resource":["collection/observability-agent"]}],"AllowFromPublic":true}]'

# Create data access policy
aws opensearchserverless create-access-policy \
  --name observability-access \
  --type data \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\"]}]"

# Create collection
aws opensearchserverless create-collection \
  --name observability-agent \
  --type SEARCH

Wait for the collection to become ACTIVE (~2-3 minutes):

aws opensearchserverless batch-get-collection --names observability-agent

Step 3: Configure Environment

# Get collection endpoint
export OPENSEARCH_HOST=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].collectionEndpoint' \
  --output text | sed 's|https://||')

echo "OpenSearch Host: $OPENSEARCH_HOST"

Step 4: Generate Test Data

python scripts/generate_test_data.py

This creates sample logs and traces simulating a payment service failure with ~40% error rate.

Step 5: Deploy the Agent

pip install bedrock-agentcore-starter-toolkit

agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Step 6: Grant Permissions

Get the AgentCore role name from the deploy output, then:

# Replace with your actual role name from deploy output
export AGENTCORE_ROLE=AmazonBedrockAgentCoreSDKRuntime-us-east-1-XXXXXX
export COLLECTION_ID=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].id' \
  --output text)

# Add OpenSearch permissions
aws iam put-role-policy \
  --role-name $AGENTCORE_ROLE \
  --policy-name OpenSearchServerlessAccess \
  --policy-document "{
    \"Version\": \"2012-10-17\",
    \"Statement\": [{
      \"Effect\": \"Allow\",
      \"Action\": [\"aoss:APIAccessAll\"],
      \"Resource\": \"arn:aws:aoss:${AWS_REGION}:${AWS_ACCOUNT_ID}:collection/${COLLECTION_ID}\"
    }]
  }"

# Update OpenSearch data access policy
POLICY_VERSION=$(aws opensearchserverless get-access-policy \
  --name observability-access --type data \
  --query 'accessPolicyDetail.policyVersion' --output text)

aws opensearchserverless update-access-policy \
  --name observability-access \
  --type data \
  --policy-version $POLICY_VERSION \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\",\"arn:aws:iam::${AWS_ACCOUNT_ID}:role/${AGENTCORE_ROLE}\"]}]"

Step 7: Test the Agent

# Wait for IAM propagation
sleep 30

# Health check — uses get_red_metrics to show Rate/Error/Duration per service
agentcore invoke '{"prompt": "Give me a health overview of all my services"}'

# Error investigation — uses search_logs + get_red_metrics to find errors
agentcore invoke '{"prompt": "Are there any errors in my application?"}'

# Service deep dive — uses get_spans + search_logs to trace failures
agentcore invoke '{"prompt": "What is wrong with the payment service? Show me the error traces"}'

# Trace correlation — uses get_spans + search_logs to follow a request across services
agentcore invoke '{"prompt": "Find a failed trace and show me the full request flow across all services"}'

# Metrics query — uses query_metrics to check infrastructure health
agentcore invoke '{"prompt": "Query prometheus for CPU and memory metrics"}'

# Root cause analysis — agent combines all tools to investigate
agentcore invoke '{"prompt": "Our checkout is failing for some users. Investigate the root cause and suggest fixes"}'

Integrate with Existing Infrastructure

Prerequisites

Existing Amazon OpenSearch Service or OpenSearch Serverless with logs/traces
(Optional) Amazon Managed Service for Prometheus with metrics
Python 3.11+

Step 1: Clone and Setup

git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Step 2: Configure Your Endpoints

Edit agent/main.py or set environment variables:

# For OpenSearch Serverless
export OPENSEARCH_HOST=your-collection-id.us-east-1.aoss.amazonaws.com

# For Amazon Managed Prometheus (optional)
export AMP_WORKSPACE_ID=ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

export AWS_REGION=us-east-1

Step 3: Customize Index Patterns (if needed)

If your indices use different naming conventions, update the index patterns in agent/main.py:

# Default patterns (OpenTelemetry standard)
"otel-v1-apm-span-*"  # For traces
"otel-logs-*"          # For logs

# Example: Custom patterns
"your-traces-*"
"your-logs-*"

Step 4: Deploy and Configure Permissions

pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Then add the AgentCore role to your OpenSearch data access policy (see Step 6 in Quick Start).

Step 5: Test

agentcore invoke '{"prompt": "Show me the health of my services"}'

Agent Tools

The agent exposes four tools for querying observability data:

Tool	Data Source	Description
`get_red_metrics`	OpenSearch (traces)	Rate, Error, Duration metrics by service
`search_logs`	OpenSearch (logs)	Search logs by service, severity
`get_spans`	OpenSearch (traces)	Search distributed trace spans
`query_metrics`	Prometheus	Query metrics using PromQL

Security

This sample follows AWS security best practices:

No hardcoded credentials - Uses IAM roles for authentication
TLS everywhere - All connections use HTTPS with certificate verification
Input validation - All tool inputs are validated and sanitized
Least privilege - IAM policies grant minimal required permissions

See CONTRIBUTING.md for security issue reporting.

Clean Up

To avoid ongoing charges, delete the resources when done:

# Delete AgentCore resources
agentcore destroy

# Delete OpenSearch Serverless (if created)
aws opensearchserverless delete-collection --id YOUR_COLLECTION_ID
aws opensearchserverless delete-security-policy --name observability-enc --type encryption
aws opensearchserverless delete-security-policy --name observability-net --type network
aws opensearchserverless delete-access-policy --name observability-access --type data

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
agent		agent
docs		docs
images		images
mcp_servers		mcp_servers
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Observability Agent with Amazon Bedrock AgentCore

Architecture

Choose Your Path

Option A: Quick Start (Build from Scratch)

Option B: Integrate with Existing Infrastructure

Quick Start (Build from Scratch)

Prerequisites

Step 1: Clone and Setup

Step 2: Create OpenSearch Serverless Collection

Step 3: Configure Environment

Step 4: Generate Test Data

Step 5: Deploy the Agent

Step 6: Grant Permissions

Step 7: Test the Agent

Integrate with Existing Infrastructure

Prerequisites

Step 1: Clone and Setup

Step 2: Configure Your Endpoints

Step 3: Customize Index Patterns (if needed)

Step 4: Deploy and Configure Permissions

Step 5: Test

Agent Tools

Security

Clean Up

References

Security

License

About

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Observability Agent with Amazon Bedrock AgentCore

Architecture

Choose Your Path

Option A: Quick Start (Build from Scratch)

Option B: Integrate with Existing Infrastructure

Quick Start (Build from Scratch)

Prerequisites

Step 1: Clone and Setup

Step 2: Create OpenSearch Serverless Collection

Step 3: Configure Environment

Step 4: Generate Test Data

Step 5: Deploy the Agent

Step 6: Grant Permissions

Step 7: Test the Agent

Integrate with Existing Infrastructure

Prerequisites

Step 1: Clone and Setup

Step 2: Configure Your Endpoints

Step 3: Customize Index Patterns (if needed)

Step 4: Deploy and Configure Permissions

Step 5: Test

Agent Tools

Security

Clean Up

References

Security

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 1

Languages