| name | aws-log-analyzer |
|---|---|
| description | Troubleshoot and analyze logs from AWS CloudWatch Logs. This skill should be used when the user asks to investigate logs, troubleshoot application issues, query log groups, analyze error patterns, or perform log analysis for machines writing to AWS CloudWatch. Uses the AWS CLI for CloudWatch Logs operations. |
| allowed-tools | Bash(aws logs describe-log-groups:*),Bash(aws logs describe-log-streams:*),Bash(aws logs filter-log-events:*),Bash(aws logs start-query:*),Bash(aws logs get-query-results:*),Bash(aws logs tail:*),Bash(*/aws-log-analyzer/scripts/*.sh:*),Bash(*/tools/*/install.sh:*) |
Troubleshoot and analyze logs from AWS CloudWatch Logs - AWS's centralized logging service for applications and infrastructure.
Prerequisites:
awsCLI is installed and configured- User is already authenticated (via IAM credentials, SSO, or instance profile)
- Appropriate IAM permissions for CloudWatch Logs read operations
Installation: Use the centralized tool installation scripts:
# Check and install AWS CLI (required)
../../../tools/aws-cli/install.sh
# Check and install jq (optional, recommended)
../../../tools/jq/install.sh- Log Group: Container for log streams (typically one per application/service)
- Log Stream: Sequence of log events from a single source (e.g., instance, container)
- Log Event: Individual log entry with timestamp and message
- CloudWatch Logs Insights: SQL-like query language for advanced log analysis
Always follow this pattern:
- Start broad → identify the problem scope
- Narrow down → focus on specific errors or patterns
- Filter noise → exclude known non-critical errors
- Analyze distribution → understand when errors occur
Use CloudWatch Logs Insights for all error analysis - it supports case-insensitive regex, which is essential because logs may contain "error", "Error", or "ERROR" in different formats.
All scripts output JSON by default to make results easy to parse programmatically by AI assistants and automation tools.
For most use cases, direct JSON output is more efficient than state management:
- ✅ Single round-trip - get all data in one call
- ✅ No state lookup complexity - data is immediately available
- ✅ Reliable - no session ID or file path issues
- ✅ For typical analyses (even 10K+ errors), JSON output is manageable (~30KB)
Only use --save-state (via state management scripts) if:
- You're analyzing 100K+ log entries
- The JSON output exceeds 100KB
- You need to reference the same data across multiple analysis steps over time
Recommended approach - get all data in one call:
# Run analysis and capture full JSON output
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
# Parse specific fields as needed
echo "$OUTPUT" | jq '.total_errors'
echo "$OUTPUT" | jq '.by_severity'
echo "$OUTPUT" | jq '.top_errors[:5]'# Default: JSON output to stdout, progress to stderr
./scripts/analyze_errors.sh /aws/app/myapp 24
# Output:
{
"log_group": "/aws/app/myapp",
"hours_analyzed": 24,
"total_errors": "1247",
"by_severity": {
"critical": 15,
"error": 1200,
"warning": 25,
"failed": 7
},
"top_errors": [
{
"message": "Connection timeout to database",
"count": 342,
"percentage": 27.43,
"pattern": "Connection timeout to database"
},
...
],
"critical_errors": [...],
"top_errors_by_pattern": [
{
"pattern": "Error at <TIMESTAMP>",
"total_count": 450,
"occurrences": 12,
"examples": [
{"message": "Error at 2026-02-06 15:30:45", "count": 120},
{"message": "Error at 2026-02-06 16:45:12", "count": 95}
]
},
...
],
"hourly_distribution": [...],
"comparison": null // or populated if --compare-previous is used
}With additional flags:
# Exclude noise patterns and compare with previous period
./scripts/analyze_errors.sh /aws/app/myapp 24 --exclude-noise --compare-previous
# Output includes comparison data:
{
...
"comparison": {
"current_period": {"total_errors": 1247, "hours": 24},
"previous_period": {"total_errors": 1050, "hours": 24},
"change": "+18.76%",
"trend": "increasing"
}
}Benefits:
- Structured data - Easy to parse and extract specific fields
- Clean separation - Progress messages go to stderr, results to stdout
- Consistent format - All scripts use the same JSON structure pattern
- AI-friendly - Models can easily process and reason about JSON
Add --human flag for human-readable table/text format:
./scripts/analyze_errors.sh /aws/app/myapp 24 --human
# Output:
=== Error Analysis Results ===
Log Group: /aws/app/myapp
Total Errors: 1247
Top Errors by Frequency:
342x: Connection timeout to database
125x: Authentication failed...
...In the model context:
# Extract specific field
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.total_errors'
# Count distinct error types
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.top_errors | length'
# Get top 3 errors
./scripts/analyze_errors.sh /aws/app/myapp 24 | jq '.top_errors[:3]'The state management system (~/.aws-log-analyzer/state/) is available for very large datasets or multi-step workflows, but is NOT recommended for typical error analysis.
Shared Library: This skill uses claudio-plugin/tools/memory/scripts/state.sh - a shared state management library used across multiple skills.
Why direct JSON output is better:
- ✅ Simpler - no session IDs or file paths to manage
- ✅ Faster - single round-trip instead of save → view → parse
- ✅ More reliable - no file system dependencies
- ✅ Efficient even for 10K+ errors (~30KB JSON)
Note: analyze_errors.sh no longer supports the --save-state flag. Use direct JSON output instead (see examples above).
If you need state management for very large datasets (100K+ entries), you can manually save/load data:
# Capture output and save manually if needed
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
echo "$OUTPUT" > /tmp/analysis_result.json
# Later, load and parse
jq '.top_errors[:10]' /tmp/analysis_result.jsonView saved state:
# List all saved states
./scripts/view_state.sh
# View specific state by ID
./scripts/view_state.sh analyze_errors_1707224567
# View latest state for an operation
./scripts/view_state.sh analyze_errorsUse --save-state when:
- Working with the model and want to minimize token usage
- Results are large (thousands of log entries, many error types)
- Building multi-step workflows where later steps reference earlier results
Don't use --save-state when:
- Running scripts manually and want immediate full output
- Results are small and fit easily in context
- Doing one-off investigations
Once data is saved, you can extract specific information using jq:
# Get the total error count
./scripts/view_state.sh analyze_errors | jq '.total_errors'
# Get top 5 errors with their counts
./scripts/view_state.sh analyze_errors | jq '.top_errors[0:5][] | {message: .message, count: .count}'
# Get only critical errors (count > 10)
./scripts/view_state.sh analyze_errors | jq '.critical_errors[] | select(.count > 10)'
# Get severity breakdown
./scripts/view_state.sh analyze_errors | jq '.by_severity'
# Get errors from a specific time bucket
./scripts/view_state.sh analyze_errors | jq '.hourly_distribution[] | select(.time_bucket | contains("2026-02-06T15"))'
# Extract error patterns (grouped by similarity)
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0:5]'
# Get all errors matching a specific pattern
./scripts/view_state.sh analyze_errors | jq '.top_errors[] | select(.pattern | contains("Connection"))'
# Get percentage of errors that are critical
./scripts/view_state.sh analyze_errors | jq '(.by_severity.critical / (.total_errors | tonumber) * 100)'
# Compare current vs previous period (if --compare-previous was used)
./scripts/view_state.sh analyze_errors | jq '.comparison'
# Get examples of a specific error pattern
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0].examples'Example workflow using saved state:
# Step 1: Analyze errors with state saving
./scripts/analyze_errors.sh /aws/app/myapp 24 --save-state --exclude-noise
# Output:
# {
# "operation": "analyze_errors",
# "state_saved": true,
# "state_id": "analyze_errors_1738858234",
# "summary": {
# "log_group": "/aws/app/myapp",
# "total_errors": 1247,
# "top_error_patterns": [...]
# }
# }
# Step 2: Query specific details without re-running analysis
./scripts/view_state.sh analyze_errors | jq '.by_severity'
# Output: {"critical": 15, "error": 1200, "warning": 25, "failed": 7}
# Step 3: Get top 3 error patterns
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern[0:3]'
# Step 4: Find all errors with high frequency (> 50 occurrences)
./scripts/view_state.sh analyze_errors | jq '.top_errors[] | select(.count > 50)'All operations are performed through the following scripts:
view_state.sh- View saved script outputs (for advanced workflows only)- Note:
analyze_errors.shno longer supports--save-stateflag - State management is available through manual save/load if needed for very large datasets
- Note:
list_log_groups.sh- List available log groupslist_log_streams.sh- List log streams within a group
analyze_errors.sh- Complete error analysis (recommended for most cases)- Flags:
--human,--exclude-noise,--compare-previous - Features: Severity classification, pattern grouping, trend analysis
- Output: Full JSON by default (efficient for typical datasets)
- Flags:
find_recent_errors.sh- Quick search for recent errorsrun_insights_query.sh- Execute custom CloudWatch Logs Insights queriestrace_request.sh- Trace a request ID across multiple log groups
tail_logs.sh- Monitor logs in real-time
Pre-built CloudWatch Logs Insights queries are available in scripts/insights_queries.json:
Error Analysis:
error_analysis.total_count- Count total errorserror_analysis.by_message- Group errors by messageerror_analysis.unique_errors- Find unique errors (excludes noise)error_analysis.hourly_distribution- Hourly error distributionerror_analysis.recent_errors- Last 100 errors
Performance Analysis:
performance_analysis.slow_requests- Requests slower than 1sperformance_analysis.latency_percentiles- P50, P90, P99 latenciesperformance_analysis.requests_per_minute- Request rate
Request Tracing:
request_tracing.by_request_id- Trace by request IDrequest_tracing.by_user- Trace by user ID
Application Monitoring:
application_monitoring.status_codes- HTTP status code distributionapplication_monitoring.error_rate- Error rate percentageapplication_monitoring.top_endpoints- Most accessed endpoints
User Request: "Analyze errors for in the last 24 hours"
Execution Sequence:
# Step 1: Run complete error analysis
./scripts/analyze_errors.sh <log-group-name> 24Output Provides:
- Total error count
- Top error messages by frequency
- Critical/unique errors (excludes noise)
- Hourly error distribution
Recommended approach - capture output and parse as needed:
# Step 1: Run analysis and capture full JSON output
OUTPUT=$(./scripts/analyze_errors.sh <log-group-name> 24)
# Step 2: Extract specific fields
echo "$OUTPUT" | jq '.total_errors'
# Output: "1247"
echo "$OUTPUT" | jq '.by_severity'
# Output: {"critical": 15, "error": 1200, "warning": 25, "failed": 7}
echo "$OUTPUT" | jq '.top_errors[:3]'
# Output: Array of top 3 errors with counts and percentages
# Step 3: Get more details on specific errors if needed
./scripts/find_recent_errors.sh <log-group-name> 1 50Why this is efficient:
- Single execution of analyze_errors.sh gets all data
- No round-trips to view state
- No session ID management
- For typical datasets (even 10K errors), JSON is ~30KB - completely manageable
- Parse different fields from the same output as needed
User Request: "Check for errors in my application"
Execution Sequence:
# Step 1: Find the log group
./scripts/list_log_groups.sh /aws/application
# Step 2: Analyze errors in the identified log group
./scripts/analyze_errors.sh <log-group-name> 24User Request: "Trace request ID abc-123 through all services"
Execution Sequence:
# Single command to search all log groups with a common prefix
./scripts/trace_request.sh abc-123 /aws/myapp 24Output: Shows all log entries containing the request ID, sorted by timestamp, across all log groups.
User Request: "Watch for OutOfMemory errors in real-time"
Execution Sequence:
# Tail logs with filter pattern
./scripts/tail_logs.sh <log-group-name> "OutOfMemoryError" 1hTime formats: 1h, 30m, 2d, 5s
User Request: "Find all authentication failures in the last 6 hours"
Execution Sequence:
# Step 1: Run custom Insights query
./scripts/run_insights_query.sh <log-group-name> 6 \
'fields @timestamp, @message | filter @message like /(?i)(auth|authentication)/ and @message like /(?i)(fail|denied)/ | sort @timestamp desc | limit 100'Alternative using template query:
# Step 1: Load query from template (if you have one defined)
QUERY=$(jq -r '.custom_queries.auth_failures' scripts/insights_queries.json)
# Step 2: Run the query
./scripts/run_insights_query.sh <log-group-name> 6 "$QUERY"User Request: "Find slow database queries in the last 24 hours"
Execution Sequence:
# Step 1: Use performance template query
QUERY=$(jq -r '.performance_analysis.slow_requests' scripts/insights_queries.json)
# Step 2: Run the query
./scripts/run_insights_query.sh <log-group-name> 24 "$QUERY"For RDS slow query logs:
# Custom query for RDS slow query format
./scripts/run_insights_query.sh /aws/rds/instance/mydb/slowquery 24 \
'fields @timestamp, query_time, lock_time, rows_examined, @message | parse @message /Query_time: (?<qt>[0-9.]+)\s+Lock_time: (?<lt>[0-9.]+).*\n(?<query>.*)/ | filter qt > 1.0 | sort qt desc | limit 20'User Request: "What's happening in my application right now?"
Execution Sequence:
# Step 1: List recent log streams to see activity
./scripts/list_log_streams.sh <log-group-name> 10
# Step 2: Tail recent logs
./scripts/tail_logs.sh <log-group-name> "" 10m
# Step 3: If errors are seen, analyze them
./scripts/analyze_errors.sh <log-group-name> 1User Request: "Has the error rate increased in the last hour?"
Execution Sequence:
# Step 1: Get errors from last hour
./scripts/analyze_errors.sh <log-group-name> 1
# Step 2: Get errors from previous hour for comparison
./scripts/run_insights_query.sh <log-group-name> 2 \
'fields @timestamp | filter @message like /(?i)(error|fail|exception|critical)/ | stats count() as error_count by bin(1h)'When combining this skill with other skills (especially gitlab-job-analyzer):
See the complete optimization guide in the main CLAUDE.md documentation under "Performance Optimization for Cross-Skill Analysis".
Key optimizations:
- Parallel execution - Run GitLab + AWS analysis simultaneously in one message
- Parse JSON once - Capture output, parse multiple times with jq (don't re-run scripts)
- Smart targeting - Analyze only log groups for failing runners/components identified in GitLab analysis
- Direct JSON output - For typical analyses (10K+ errors), direct JSON is more efficient than state management
Example - Optimized cross-skill analysis:
# SINGLE MESSAGE - Parallel execution:
# Tool 1: AWS log analysis for component 1
./aws-log-analyzer/scripts/analyze_errors.sh /aws/app/component1 24
# Tool 2: AWS log analysis for component 2 (runs in parallel)
./aws-log-analyzer/scripts/analyze_errors.sh /aws/app/component2 24
# Tool 3: GitLab analysis (runs in parallel)
./gitlab-job-analyzer/scripts/analyze_recent_jobs.sh owner/repo --hours 24
# Then parse AWS output multiple ways without re-running:
echo "$AWS_OUTPUT" | jq '.total_errors'
echo "$AWS_OUTPUT" | jq '.by_severity'
echo "$AWS_OUTPUT" | jq '.top_errors[:5]'
echo "$AWS_OUTPUT" | jq '.hourly_distribution'Expected performance:
- Optimized: 3-4 minutes, $0.75-0.85
- Non-optimized: 10+ minutes, $1.06+
Logs may contain "error", "Error", or "ERROR" - always use case-insensitive regex in CloudWatch Logs Insights queries:
- ✅ Use:
/(?i)error/in Insights queries - ❌ Avoid:
"ERROR"filter patterns (case-sensitive)
For most error investigations, use analyze_errors.sh first:
- Provides complete overview in one command
- Uses case-insensitive matching
- Excludes known noise patterns
- Shows time distribution
Leverage scripts/insights_queries.json for common analysis patterns:
- Pre-tested queries for common scenarios
- Easy to customize
- Consistent results
Noise patterns are defined in scripts/noise-patterns.txt:
- GitLab Runner:
file already closed - AWS SDK throttling:
SlowDown,ThrottlingException,TooManyRequestsException - Rate limiting:
RequestLimitExceeded,Throttled,RequestThrottled - Provisioning:
ProvisionedThroughputExceededException
Usage:
# Enable noise filtering (uses patterns from noise-patterns.txt)
./scripts/analyze_errors.sh <log-group> 24 --exclude-noise
# View all noise patterns
cat scripts/noise-patterns.txt
# Add custom patterns (edit the file)
echo "MyCustomNoisePattern" >> scripts/noise-patterns.txtNote: Patterns are applied as case-insensitive regex patterns in CloudWatch Logs Insights queries.
Time range guidelines:
- Initial investigation: 1-6 hours
- Trend analysis: 24 hours
- Historical analysis: 7 days maximum
Narrower time ranges:
- Reduce CloudWatch Logs Insights costs
- Improve query performance
- Faster results
Error messages often differ only in timestamps, IPs, or IDs:
- "Error at 2026-02-06 15:30:45" vs "Error at 2026-02-06 16:45:12"
- "Connection failed to 192.168.1.100" vs "Connection failed to 192.168.1.200"
Pattern normalization groups similar errors automatically:
# The output includes both individual errors and pattern-grouped errors
./scripts/analyze_errors.sh <log-group> 24
# Query pattern-grouped errors from saved state
./scripts/view_state.sh analyze_errors | jq '.top_errors_by_pattern'Pattern output structure:
{
"pattern": "Error at <TIMESTAMP>",
"total_count": 450,
"occurrences": 12,
"examples": [
{"message": "Error at 2026-02-06 15:30:45", "count": 120},
{"message": "Error at 2026-02-06 16:45:12", "count": 95}
]
}Benefits:
- See the "true" error count (not inflated by timestamp variations)
- Identify systemic issues vs one-time errors
- Reduce noise from UUID/IP variations
For microservices/distributed architectures:
- Use
trace_request.shto follow requests across services - Ensure request IDs are logged consistently
- Search across all related log groups with a common prefix
Cause: Case-sensitive filter patterns don't match logs
Solution:
# ✅ Use Insights-based scripts (case-insensitive)
./scripts/analyze_errors.sh <log-group-name> 24
./scripts/find_recent_errors.sh <log-group-name> 1Cause: Broad search across large time range
Solution:
# Step 1: Narrow time range
./scripts/analyze_errors.sh <log-group-name> 1 # Last hour instead of 24
# Step 2: Filter by specific pattern
./scripts/run_insights_query.sh <log-group-name> 1 \
'fields @timestamp, @message | filter @message like /(?i)OutOfMemory/ | limit 50'Cause: Query too complex or time range too large
Solution:
# Step 1: Reduce time range
./scripts/analyze_errors.sh <log-group-name> 1 # Instead of 24
# Step 2: Simplify query (remove complex parsing)
# Step 3: Add more specific filters early in the query
./scripts/run_insights_query.sh <log-group-name> 1 \
'fields @timestamp, @message | filter @message like /(?i)specific_error/ | stats count()'Cause: Typo or wrong region
Solution:
# Step 1: List all log groups to verify name
./scripts/list_log_groups.sh
# Step 2: Search with prefix
./scripts/list_log_groups.sh /aws/application
# Step 3: Verify AWS region is correct (check AWS CLI config)Workflow: Correlate K8s pod events with application logs
# Step 1: Get pod name from kubernetes skill
# (kubernetes skill command)
# Step 2: Search CloudWatch logs for that pod
./scripts/trace_request.sh <pod-name> /aws/application 24Workflow: Investigate deployment-related errors
# Step 1: Get commit SHA from gitlab skill
# (gitlab skill command)
# Step 2: Search logs for that deployment
./scripts/trace_request.sh <commit-sha> /aws/application 24
# Step 3: Analyze errors during deployment window
./scripts/analyze_errors.sh /aws/application/myapp 1AWS Services:
/aws/lambda/<function-name>
/aws/rds/instance/<instance-id>/*
/aws/ecs/containerinsights/<cluster>
/aws/eks/<cluster>/cluster
/aws/apigateway/<api-id>/<stage>
Application Logs:
/aws/application/<app-name>
/var/log/messages
/aws/containerinsights/<cluster>/*
All commands output JSON by default. Add --human for human-readable format.
# Analyze errors (JSON output) - RECOMMENDED
OUTPUT=$(./scripts/analyze_errors.sh <log-group> 24)
echo "$OUTPUT" | jq '.total_errors'
# Analyze errors (human-readable)
./scripts/analyze_errors.sh <log-group> 24 --human
# Analyze errors with noise filtering and comparison
./scripts/analyze_errors.sh <log-group> 24 --exclude-noise --compare-previous
# Find recent errors
./scripts/find_recent_errors.sh <log-group> 1
# Trace a request
./scripts/trace_request.sh <request-id> <log-group-prefix> 24
# Monitor in real-time
./scripts/tail_logs.sh <log-group>
# List log groups
./scripts/list_log_groups.sh
# Custom query
./scripts/run_insights_query.sh <log-group> 24 '<insights-query>'
# Parse JSON output
./scripts/analyze_errors.sh <log-group> 24 | jq '.total_errors'
./scripts/list_log_groups.sh | jq '.log_groups[].name'For most scripts (hours):
1= last 1 hour24= last 24 hours168= last 7 days
For tail_logs.sh (relative time):
1h= last hour30m= last 30 minutes2d= last 2 days
Required:
awsCLI v2 (recommended) or v1- Install:
../../../tools/aws-cli/install.sh - Check:
../../../tools/aws-cli/install.sh --check
- Install:
Optional (recommended):
jq- JSON processor for parsing outputs- Install:
../../../tools/jq/install.sh - Check:
../../../tools/jq/install.sh --check
- Install:
All installation scripts:
- Auto-detect architecture (Linux x86_64, ARM64)
- No root access required
- Idempotent (safe to run multiple times)
- Version tracking for Renovate updates