Skip to content

aws-samples/sample-observability-agent-bedrock-agentcore

Observability Agent with Amazon Bedrock AgentCore

License: MIT-0 Python 3.11+ AWS Strands SDK

An AI-powered observability agent that helps Site Reliability Engineers (SREs) investigate incidents and reduce Mean Time to Resolution (MTTR). Built with Amazon Bedrock AgentCore and the Strands Agent SDK.

This sample implements the architecture from Reduce Mean Time to Resolution with an observability agent.

Architecture

Observability Agent Architecture

The agent queries three data sources to investigate incidents:

  • Logs - Application logs stored in Amazon OpenSearch Serverless
  • Traces - Distributed traces stored in Amazon OpenSearch Serverless
  • Metrics - Infrastructure metrics stored in Amazon Managed Service for Prometheus

Choose Your Path

Option A: Quick Start (Build from Scratch)

Best for: Learning, POC, testing the agent with sample data.

This path creates all required AWS resources and populates them with test data simulating a payment service failure.

Time: ~15 minutes | Cost: ~$5/day for OpenSearch Serverless

→ Quick Start Guide

Option B: Integrate with Existing Infrastructure

Best for: Production use with your existing observability stack.

This path connects the agent to your existing OpenSearch and Prometheus deployments.

Time: ~10 minutes | Prerequisites: Existing OpenSearch + Prometheus

→ Integration Guide


Quick Start (Build from Scratch)

Prerequisites

  • AWS account with appropriate permissions
  • Python 3.11+
  • AWS CLI configured with credentials

Step 1: Clone and Setup

git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Step 2: Create OpenSearch Serverless Collection

# Set your AWS account ID
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export AWS_REGION=us-east-1

# Create encryption policy
aws opensearchserverless create-security-policy \
  --name observability-enc \
  --type encryption \
  --policy '{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]}],"AWSOwnedKey":true}'

# Create network policy
aws opensearchserverless create-security-policy \
  --name observability-net \
  --type network \
  --policy '[{"Rules":[{"ResourceType":"collection","Resource":["collection/observability-agent"]},{"ResourceType":"dashboard","Resource":["collection/observability-agent"]}],"AllowFromPublic":true}]'

# Create data access policy
aws opensearchserverless create-access-policy \
  --name observability-access \
  --type data \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\"]}]"

# Create collection
aws opensearchserverless create-collection \
  --name observability-agent \
  --type SEARCH

Wait for the collection to become ACTIVE (~2-3 minutes):

aws opensearchserverless batch-get-collection --names observability-agent

Step 3: Configure Environment

# Get collection endpoint
export OPENSEARCH_HOST=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].collectionEndpoint' \
  --output text | sed 's|https://||')

echo "OpenSearch Host: $OPENSEARCH_HOST"

Step 4: Generate Test Data

python scripts/generate_test_data.py

This creates sample logs and traces simulating a payment service failure with ~40% error rate.

Step 5: Deploy the Agent

pip install bedrock-agentcore-starter-toolkit

agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Step 6: Grant Permissions

Get the AgentCore role name from the deploy output, then:

# Replace with your actual role name from deploy output
export AGENTCORE_ROLE=AmazonBedrockAgentCoreSDKRuntime-us-east-1-XXXXXX
export COLLECTION_ID=$(aws opensearchserverless batch-get-collection \
  --names observability-agent \
  --query 'collectionDetails[0].id' \
  --output text)

# Add OpenSearch permissions
aws iam put-role-policy \
  --role-name $AGENTCORE_ROLE \
  --policy-name OpenSearchServerlessAccess \
  --policy-document "{
    \"Version\": \"2012-10-17\",
    \"Statement\": [{
      \"Effect\": \"Allow\",
      \"Action\": [\"aoss:APIAccessAll\"],
      \"Resource\": \"arn:aws:aoss:${AWS_REGION}:${AWS_ACCOUNT_ID}:collection/${COLLECTION_ID}\"
    }]
  }"

# Update OpenSearch data access policy
POLICY_VERSION=$(aws opensearchserverless get-access-policy \
  --name observability-access --type data \
  --query 'accessPolicyDetail.policyVersion' --output text)

aws opensearchserverless update-access-policy \
  --name observability-access \
  --type data \
  --policy-version $POLICY_VERSION \
  --policy "[{\"Rules\":[{\"ResourceType\":\"collection\",\"Resource\":[\"collection/observability-agent\"],\"Permission\":[\"aoss:*\"]},{\"ResourceType\":\"index\",\"Resource\":[\"index/observability-agent/*\"],\"Permission\":[\"aoss:*\"]}],\"Principal\":[\"arn:aws:iam::${AWS_ACCOUNT_ID}:root\",\"arn:aws:iam::${AWS_ACCOUNT_ID}:role/${AGENTCORE_ROLE}\"]}]"

Step 7: Test the Agent

# Wait for IAM propagation
sleep 30

# Health check — uses get_red_metrics to show Rate/Error/Duration per service
agentcore invoke '{"prompt": "Give me a health overview of all my services"}'

# Error investigation — uses search_logs + get_red_metrics to find errors
agentcore invoke '{"prompt": "Are there any errors in my application?"}'

# Service deep dive — uses get_spans + search_logs to trace failures
agentcore invoke '{"prompt": "What is wrong with the payment service? Show me the error traces"}'

# Trace correlation — uses get_spans + search_logs to follow a request across services
agentcore invoke '{"prompt": "Find a failed trace and show me the full request flow across all services"}'

# Metrics query — uses query_metrics to check infrastructure health
agentcore invoke '{"prompt": "Query prometheus for CPU and memory metrics"}'

# Root cause analysis — agent combines all tools to investigate
agentcore invoke '{"prompt": "Our checkout is failing for some users. Investigate the root cause and suggest fixes"}'

Integrate with Existing Infrastructure

Prerequisites

  • Existing Amazon OpenSearch Service or OpenSearch Serverless with logs/traces
  • (Optional) Amazon Managed Service for Prometheus with metrics
  • Python 3.11+

Step 1: Clone and Setup

git clone https://github.com/aws-samples/observability-agent-bedrock-agentcore.git
cd observability-agent-bedrock-agentcore

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Step 2: Configure Your Endpoints

Edit agent/main.py or set environment variables:

# For OpenSearch Serverless
export OPENSEARCH_HOST=your-collection-id.us-east-1.aoss.amazonaws.com

# For Amazon Managed Prometheus (optional)
export AMP_WORKSPACE_ID=ws-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

export AWS_REGION=us-east-1

Step 3: Customize Index Patterns (if needed)

If your indices use different naming conventions, update the index patterns in agent/main.py:

# Default patterns (OpenTelemetry standard)
"otel-v1-apm-span-*"  # For traces
"otel-logs-*"          # For logs

# Example: Custom patterns
"your-traces-*"
"your-logs-*"

Step 4: Deploy and Configure Permissions

pip install bedrock-agentcore-starter-toolkit
agentcore configure --entrypoint agent/main.py --non-interactive
agentcore deploy

Then add the AgentCore role to your OpenSearch data access policy (see Step 6 in Quick Start).

Step 5: Test

agentcore invoke '{"prompt": "Show me the health of my services"}'

Agent Tools

The agent exposes four tools for querying observability data:

Tool Data Source Description
get_red_metrics OpenSearch (traces) Rate, Error, Duration metrics by service
search_logs OpenSearch (logs) Search logs by service, severity
get_spans OpenSearch (traces) Search distributed trace spans
query_metrics Prometheus Query metrics using PromQL

Security

This sample follows AWS security best practices:

  • No hardcoded credentials - Uses IAM roles for authentication
  • TLS everywhere - All connections use HTTPS with certificate verification
  • Input validation - All tool inputs are validated and sanitized
  • Least privilege - IAM policies grant minimal required permissions

See CONTRIBUTING.md for security issue reporting.

Clean Up

To avoid ongoing charges, delete the resources when done:

# Delete AgentCore resources
agentcore destroy

# Delete OpenSearch Serverless (if created)
aws opensearchserverless delete-collection --id YOUR_COLLECTION_ID
aws opensearchserverless delete-security-policy --name observability-enc --type encryption
aws opensearchserverless delete-security-policy --name observability-net --type network
aws opensearchserverless delete-access-policy --name observability-access --type data

References

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

An AI-powered observability agent that helps Site Reliability Engineers (SREs) investigate incidents and MTTR.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Languages