AI-powered monitoring and observability stack for AWS using CloudWatch, Lambda-based anomaly detection, Grafana dashboards, and intelligent alerting
- Overview
- Architecture
- Features
- Prerequisites
- Quick Start
- Modules
- Lambda Functions
- Dashboards
- Cost Estimation
- Contributing
The AWS AIOps Monitoring Stack provides a comprehensive, production-ready solution for AI-powered IT operations on AWS. This Terraform-based stack combines CloudWatch metrics, intelligent log analysis, anomaly detection, and automated alerting to help you proactively identify and resolve infrastructure issues.
- Intelligent Log Analysis: AI-powered log pattern detection and error analysis
- Anomaly Detection: Statistical and ML-based anomaly scoring for metrics
- Cost Monitoring: Automated cost anomaly detection and budget alerts
- Multi-Channel Alerting: Slack, PagerDuty, and email notifications
- Comprehensive Dashboards: Pre-built CloudWatch and Grafana dashboards
- Production Ready: Fully tested Terraform modules with best practices
βββββββββββββββββββ
β CloudWatch β
β Log Groups β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β CloudWatch β
β Metrics β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Log Analyzer β β Anomaly Scorer β
β Lambda β β Lambda β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
β β
ββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββ
β SNS Topic β
ββββββββββ¬βββββββββ
β
ββββββββββ΄βββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Slack β β PagerDuty β
β Integration β β Integration β
βββββββββββββββββββ βββββββββββββββββββ
- Logs flow from AWS services (Lambda, ECS, etc.) into CloudWatch Log Groups
- Metrics are collected by CloudWatch from various AWS services
- Log Analyzer Lambda processes logs every 5 minutes, detecting patterns and errors
- Anomaly Scorer Lambda analyzes metrics using statistical methods (Z-score, percentiles, trends)
- Alerts are published to SNS when anomalies or errors are detected
- Notifications are sent to Slack, PagerDuty, or email based on configuration
- β CloudWatch Dashboards: Pre-configured dashboards for infrastructure and cost monitoring
- β Intelligent Alarms: Threshold-based and anomaly detection alarms
- β Composite Alarms: Combine multiple alarms for complex alerting logic
- β Cost Anomaly Detection: AWS Cost Anomaly Detection integration
- β Lambda-Based Analysis: Python Lambda functions for log and metric analysis
- β Multi-Channel Notifications: Slack, PagerDuty, and SNS email support
- β Bedrock Integration: Optional AWS Bedrock for advanced AI-powered insights
- β Grafana Dashboards: JSON configurations for Grafana visualization
- Statistical Anomaly Detection: Z-score, percentile analysis, and trend detection
- Pattern Recognition: Automatic detection of error patterns in logs
- Custom Metrics: Publishes anomaly scores and analysis results as CloudWatch metrics
- EventBridge Integration: Scheduled execution of analysis functions
- IAM Best Practices: Least-privilege IAM roles and policies
- Resource Tagging: Consistent tagging strategy across all resources
Before deploying this stack, ensure you have:
- AWS Account with appropriate permissions
- Terraform >= 1.0 installed
- AWS CLI configured with credentials
- Python 3.11 (for local Lambda testing, optional)
- GitHub CLI (
gh) for repository creation (optional)
The AWS credentials used must have permissions for:
- CloudWatch (metrics, logs, alarms, dashboards)
- Lambda (create, update, invoke)
- SNS (create topics, subscriptions)
- IAM (create roles and policies)
- Cost Explorer (for cost anomaly detection)
- EventBridge (for scheduled rules)
- Slack: Webhook URL for Slack notifications
- PagerDuty: Integration key for PagerDuty alerts
- AWS Bedrock: Access to Bedrock models for advanced AI analysis
git clone https://github.com/hammadhaqqani/aws-aiops-monitoring-stack.git
cd aws-aiops-monitoring-stackCopy the example variables file and customize:
cd examples/complete
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars with your values:
region = "us-east-1"
environment = "prod"
project_name = "my-aiops-stack"
slack_webhook_url = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
pagerduty_integration_key = "your-pagerduty-key"
sns_email_addresses = ["admin@example.com"]
log_groups = [
"/aws/lambda/my-function-1",
"/aws/lambda/my-function-2"
]terraform initterraform planterraform applyAfter deployment, you'll receive outputs including:
- SNS Topic ARN
- Lambda function ARNs
- CloudWatch Dashboard URLs
Access the dashboards:
- Main Dashboard:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=aiops-monitoring-main-prod - Cost Dashboard:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=aiops-monitoring-cost-prod
Creates pre-configured CloudWatch dashboards for infrastructure and cost monitoring.
Usage:
module "cloudwatch_dashboards" {
source = "./modules/cloudwatch-dashboards"
project_name = "my-project"
environment = "prod"
log_groups = ["/aws/lambda/function1"]
}Outputs:
dashboard_urls: Map of dashboard names to URLs
Creates threshold-based alarms and composite alarms for infrastructure monitoring.
Usage:
module "cloudwatch_alarms" {
source = "./modules/cloudwatch-alarms"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
log_groups = ["/aws/lambda/function1"]
}Features:
- Lambda error rate alarms
- Lambda duration alarms
- Lambda throttle alarms
- Log error pattern detection
- Composite alarms combining multiple conditions
Enables CloudWatch anomaly detection for key metrics using ML-based algorithms.
Usage:
module "anomaly_detection" {
source = "./modules/anomaly-detection"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
}Features:
- Lambda duration anomaly detection
- Lambda error anomaly detection
- ALB response time anomaly detection
- Automatic baseline learning
Integrates with AWS Cost Anomaly Detection for automated cost monitoring.
Usage:
module "cost_anomaly" {
source = "./modules/cost-anomaly"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
account_id = "123456789012"
threshold = 50 # USD
}Features:
- Dimensional cost anomaly monitoring
- Daily anomaly reports
- Threshold-based alerts
- Billing alarm integration
Configures Slack and PagerDuty integrations for alerting.
Usage:
module "notifications" {
source = "./modules/notifications"
project_name = "my-project"
environment = "prod"
sns_topic_arn = aws_sns_topic.alerts.arn
slack_webhook_url = var.slack_webhook_url
pagerduty_integration_key = var.pagerduty_integration_key
}Features:
- Slack webhook integration
- PagerDuty event API integration
- Rich message formatting
- Automatic incident creation/resolution
Analyzes CloudWatch Logs for patterns, errors, and anomalies.
Capabilities:
- Error pattern detection (ERROR, FATAL, EXCEPTION, etc.)
- Statistical analysis (error rates, message patterns)
- Anomaly scoring (0-100 scale)
- Optional AWS Bedrock integration for AI-powered insights
- Custom CloudWatch metrics publication
Trigger: EventBridge rule (every 5 minutes)
Input:
{
"log_groups": ["/aws/lambda/function1"],
"hours": 1
}Output:
- Anomaly scores published to CloudWatch
- SNS alerts for high-severity issues
- AI insights (if Bedrock enabled)
Calculates anomaly scores for CloudWatch metrics using statistical methods.
Capabilities:
- Z-score calculation
- Percentile analysis
- Trend detection (increasing/decreasing/stable)
- Multi-metric analysis
- Baseline learning from historical data
Trigger: EventBridge rule or manual invocation
Input:
{
"metrics": [
{
"namespace": "AWS/Lambda",
"metric_name": "Duration",
"statistic": "Average"
}
]
}Output:
- Anomaly scores (0-1 scale)
- Severity classification (LOW/MEDIUM/HIGH/CRITICAL)
- SNS alerts for anomalies above threshold
Pre-built dashboards are automatically created:
-
Main Dashboard (
aiops-monitoring-main-{env})- Lambda metrics overview
- Error logs
- ALB metrics
- ECS container metrics
-
Cost Dashboard (
aiops-monitoring-cost-{env})- Daily AWS charges
- Cost trends
- Lambda cost drivers
JSON configurations are provided in dashboards/grafana/:
-
Infrastructure Overview (
infrastructure-overview.json)- Lambda invocations and errors
- Anomaly scores
- ALB response times
- Error rates and active alarms
-
Cost Analysis (
cost-analysis.json)- Daily charges
- Cost by service
- Cost anomaly detection
- Monthly cost forecast
Import Instructions:
- Open Grafana
- Go to Dashboards β Import
- Upload the JSON file
- Configure data source (CloudWatch or Prometheus)
| Service | Usage | Cost |
|---|---|---|
| CloudWatch Metrics | ~100 metrics | $0.30 |
| CloudWatch Logs | 5 GB ingestion | $2.50 |
| CloudWatch Alarms | 20 alarms | $6.00 |
| Lambda Invocations | 8,640/month (5-min schedule) | $0.17 |
| Lambda Compute | 512 MB, 5-min runs | $2.00 |
| SNS | 1,000 notifications | $0.50 |
| Cost Anomaly Detection | Included | $0.00 |
| Total | ~$11.50/month |
- Reduce Log Retention: Adjust log retention periods based on needs
- Optimize Lambda Memory: Tune memory size based on actual usage
- Filter Logs: Use metric filters to reduce log ingestion
- Consolidate Alarms: Use composite alarms to reduce alarm count
- Schedule Analysis: Adjust Lambda schedule frequency based on requirements
- CloudWatch: 10 custom metrics, 5 GB log ingestion
- Lambda: 1M requests, 400,000 GB-seconds
- SNS: 1M requests
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests if applicable
- Follow Terraform best practices:
- Use
terraform fmtbefore committing - Validate with
terraform validate - Document new variables and outputs
- Use
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install pre-commit hooks (optional)
pre-commit install
# Format Terraform code
terraform fmt -recursive
# Validate Terraform
terraform validate
# Run security scan
tfsec .- Use meaningful variable and resource names
- Add descriptions to all variables and outputs
- Follow Terraform style guide
- Include examples in module documentation
- Update README for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- AWS CloudWatch team for comprehensive monitoring capabilities
- HashiCorp for Terraform
- The open-source community for inspiration and feedback
For issues, questions, or contributions:
- Open an issue on GitHub
- Check existing documentation
- Review example configurations
If you find this useful, consider buying me a coffee!