An AI-powered observability agent that helps Site Reliability Engineers (SREs) investigate incidents and reduce Mean Time to Resolution (MTTR). Built with Amazon Bedrock AgentCore and the Strands Agent SDK.
This sample implements the architecture from Reduce Mean Time to Resolution with an observability agent.
The agent queries three data sources to investigate incidents:
- Logs - Application logs stored in Amazon OpenSearch Serverless
- Traces - Distributed traces stored in Amazon OpenSearch Serverless
- Metrics - Infrastructure metrics stored in Amazon Managed Service for Prometheus
Best for: Learning, POC, testing the agent with sample data.
This path creates all required AWS resources and populates them with test data simulating a payment service failure.
Time: ~15 minutes | Cost: ~$5/day for OpenSearch Serverless
Best for: Production use with your existing observability stack.
This path connects the agent to your existing OpenSearch and Prometheus deployments.
Time: ~10 minutes | Prerequisites: Existing OpenSearch + Prometheus
- AWS account with appropriate permissions
- Python 3.11+
- AWS CLI configured with credentials
git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt# Set your AWS account ID
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=us-east-1
# Create encryption policy
aws opensearchserverless create-security-policy \
--name observability-enc \
--type encryption \
--policy '{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]}],"AWSOwnedKey":true}'
# Create network policy
aws opensearchserverless create-security-policy \
--name observability-net \
--type network \
--policy '[{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]},{"ResourceType":"dashboard","Resource":["collection/observability-agent"]}],"AllowFromPublic":true}]'
# Create data access policy
aws opensearchserverless create-access-policy \
--name observability-access \
--type data \
--policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\"]}]"
# Create collection
aws opensearchserverless create-collection \
--name observability-agent \
--type SEARCHWait for the collection to become ACTIVE (~2-3 minutes):
aws opensearchserverless batch-get-collection --names observability-agent# Get collection endpoint
export OPENSEARCH_HOST=$(aws opensearchserverless batch-get-collection \
--names observability-agent \
--query 'collectionDetails[0].collectionEndpoint' \
--output text | sed 's|https://||')
echo "OpenSearch Host: $OPENSEARCH_HOST"python scripts/generate_test_data.pyThis creates sample logs and traces simulating a payment service failure with ~40% error rate.
pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deployGet the AgentCore role name from the deploy output, then:
# Replace with your actual role name from deploy output
export AGENTCORE_ROLE=AmazonBedrockAgentCoreSDKRuntime-us-east-1-XXXXXX
export COLLECTION_ID=$(aws opensearchserverless batch-get-collection \
--names observability-agent \
--query 'collectionDetails[0].id' \
--output text)
# Add OpenSearch permissions
aws iam put-role-policy \
--role-name $AGENTCORE_ROLE \
--policy-name OpenSearchServerlessAccess \
--policy-document "{
\"Version\": \"2012-10-17\",
\"Statement\": [{
\"Effect\": \"Allow\",
\"Action\": [\"aoss:APIAccessAll\"],
\"Resource\": \"arn:aws:aoss:${AWS_REGION}:${AWS_ACCOUNT_ID}:collection/${COLLECTION_ID}\"
}]
}"
# Update OpenSearch data access policy
POLICY_VERSION=$(aws opensearchserverless get-access-policy \
--name observability-access --type data \
--query 'accessPolicyDetail.policyVersion' --output text)
aws opensearchserverless update-access-policy \
--name observability-access \
--type data \
--policy-version $POLICY_VERSION \
--policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\",\"arn:aws:iam::${AWS_ACCOUNT_ID}:role/${AGENTCORE_ROLE}\"]}]"# Wait for IAM propagation
sleep 30
# Health check — uses get_red_metrics to show Rate/Error/Duration per service
agentcore invoke '{"prompt": "Give me a health overview of all my services"}'
# Error investigation — uses search_logs + get_red_metrics to find errors
agentcore invoke '{"prompt": "Are there any errors in my application?"}'
# Service deep dive — uses get_spans + search_logs to trace failures
agentcore invoke '{"prompt": "What is wrong with the payment service? Show me the error traces"}'
# Trace correlation — uses get_spans + search_logs to follow a request across services
agentcore invoke '{"prompt": "Find a failed trace and show me the full request flow across all services"}'
# Metrics query — uses query_metrics to check infrastructure health
agentcore invoke '{"prompt": "Query prometheus for CPU and memory metrics"}'
# Root cause analysis — agent combines all tools to investigate
agentcore invoke '{"prompt": "Our checkout is failing for some users. Investigate the root cause and suggest fixes"}'- Existing Amazon OpenSearch Service or OpenSearch Serverless with logs/traces
- (Optional) Amazon Managed Service for Prometheus with metrics
- Python 3.11+
git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtEdit agent/main.py or set environment variables:
# For OpenSearch Serverless
export OPENSEARCH_HOST=your-collection-id.us-east-1.aoss.amazonaws.com
# For Amazon Managed Prometheus (optional)
export AMP_WORKSPACE_ID=ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export AWS_REGION=us-east-1If your indices use different naming conventions, update the index patterns in agent/main.py:
# Default patterns (OpenTelemetry standard)
"otel-v1-apm-span-*" # For traces
"otel-logs-*" # For logs
# Example: Custom patterns
"your-traces-*"
"your-logs-*"pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deployThen add the AgentCore role to your OpenSearch data access policy (see Step 6 in Quick Start).
agentcore invoke '{"prompt": "Show me the health of my services"}'The agent exposes four tools for querying observability data:
| Tool | Data Source | Description |
|---|---|---|
get_red_metrics |
OpenSearch (traces) | Rate, Error, Duration metrics by service |
search_logs |
OpenSearch (logs) | Search logs by service, severity |
get_spans |
OpenSearch (traces) | Search distributed trace spans |
query_metrics |
Prometheus | Query metrics using PromQL |
This sample follows AWS security best practices:
- No hardcoded credentials - Uses IAM roles for authentication
- TLS everywhere - All connections use HTTPS with certificate verification
- Input validation - All tool inputs are validated and sanitized
- Least privilege - IAM policies grant minimal required permissions
See CONTRIBUTING.md for security issue reporting.
To avoid ongoing charges, delete the resources when done:
# Delete AgentCore resources
agentcore destroy
# Delete OpenSearch Serverless (if created)
aws opensearchserverless delete-collection --id YOUR_COLLECTION_ID
aws opensearchserverless delete-security-policy --name observability-enc --type encryption
aws opensearchserverless delete-security-policy --name observability-net --type network
aws opensearchserverless delete-access-policy --name observability-access --type data- AWS Blog: Reduce MTTR with an Observability Agent
- Amazon Bedrock AgentCore Documentation
- Strands Agents SDK
- Amazon OpenSearch Serverless
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
