This document outlines the standards and best practices for implementing monitoring and observability across all Bayat projects. Following these guidelines ensures consistent, effective monitoring that provides actionable insights into system health and performance.
- Monitoring Principles
- Metrics Standards
- Logging Standards
- Tracing Standards
- Alerting Standards
- Dashboards and Visualization
- Tool Selection
- Monitoring as Code
- Health Checks
- SLIs, SLOs, and SLAs
- On-Call and Incident Response
- Capacity Planning
- Security Monitoring
- Cost Monitoring
- Monitoring Governance
All monitoring and observability at Bayat should adhere to these core principles:
- Actionable: Monitoring should provide actionable information
- Relevant: Focus on what matters to the business and users
- Timely: Detect and alert on issues promptly
- Transparent: Make monitoring accessible to all stakeholders
- Comprehensive: Cover all critical systems and components
- Automated: Automate monitoring setup and maintenance
- Contextual: Provide context to understand the significance of metrics
- Proportional: Balance monitoring effort with system importance
- Historical: Retain historical data for trend analysis
- Correlated: Connect related information across monitoring systems
Standardize on these core metrics for all systems:
Category | Metrics | Description |
---|---|---|
Service Level | Availability, Error Rate, Latency | Overall service health |
Resource | CPU, Memory, Disk, Network | Resource utilization |
Application | Request Count, Success Rate, Duration | Application performance |
Business | Transactions, User Activity, Conversions | Business impact |
Dependencies | External Service Health, Response Time | Dependency health |
Follow a consistent naming convention:
- Use lowercase with underscores as separators
- Use the format:
[namespace]_[metric_name]_[unit]
- Group related metrics with a common prefix
- Be specific about what is being measured
- Include the unit of measurement where applicable
Examples:
http_requests_total
api_request_duration_seconds
database_connections_current
memory_usage_bytes
Use appropriate metric types:
- Counter: Ever-increasing values (e.g., request count, errors)
- Gauge: Values that can go up and down (e.g., memory usage, queue depth)
- Histogram: Distribution of values (e.g., request duration)
- Summary: Similar to histogram but with calculated quantiles
Manage metric cardinality:
- Limit high-cardinality labels (e.g., user IDs, IP addresses)
- Focus on meaningful aggregations
- Consider sampling for high-volume metrics
- Document cardinality limits for each system
Standardize collection frequencies:
- Critical Metrics: 10-30 second intervals
- Standard Metrics: 1 minute intervals
- Slow-changing Metrics: 5+ minute intervals
Document the rationale for any deviations from these standards.
Use standard log levels consistently:
- ERROR: System errors requiring immediate attention
- WARN: Potential issues that don't stop operation
- INFO: Normal operational information
- DEBUG: Detailed information for debugging
- TRACE: Very detailed debugging information
Structure logs in a consistent JSON format:
{
"timestamp": "2023-05-08T12:34:56.789Z",
"level": "INFO",
"service": "payment-service",
"instance": "payment-api-67890",
"trace_id": "abc123def456",
"message": "Payment processed successfully",
"context": {
"user_id": "user-123",
"payment_id": "pmt-456",
"amount": 99.95,
"currency": "USD"
}
}
Include these standard fields in all logs:
- timestamp: ISO 8601 format with timezone
- level: Log level
- service: Service or component name
- instance: Instance identifier
- trace_id: Distributed tracing ID
- message: Human-readable message
- context: Relevant contextual information
Handle sensitive information in logs:
- Never log passwords, tokens, or credentials
- Mask or truncate sensitive personal data
- Comply with relevant privacy regulations
- Document what should and should not be logged
Define standard log retention periods:
- Production: Minimum 90 days
- Non-production: Minimum 30 days
- Security events: Minimum 1 year
- Compliance-related: As required by regulations
Include relevant context in logs:
- Link logs to specific requests or transactions
- Include business identifiers (user ID, order ID, etc.)
- Add relevant technical context (server, component, etc.)
- Tag logs with environment information
Implement distributed tracing across all services:
- Use W3C Trace Context or OpenTelemetry standards
- Propagate trace context in all service-to-service calls
- Ensure consistent trace ID format across all systems
- Include trace IDs in logs and metrics where relevant
Use consistent span naming:
- Include the operation type and target
- Be specific but avoid high cardinality
- Use the format:
[operation].[target]
- Document standard span names for common operations
Examples:
http.request
db.query
cache.get
payment.process
Include these standard attributes in spans:
- service.name: Name of the service
- service.version: Version of the service
- host.name: Name of the host
- http.method: HTTP method for web requests
- http.url: URL for web requests
- http.status_code: Status code for HTTP responses
- db.system: Database system type
- db.operation: Database operation type
Implement a balanced sampling strategy:
- Sample 100% of traffic in development environments
- Use adaptive sampling in production
- Always trace errors and critical transactions
- Document sampling decisions and rationales
Define clear severity levels:
Level | Description | Response Time | Notification Method |
---|---|---|---|
P1 | Critical business impact, service down | Immediate (24/7) | Phone, SMS, dedicated channel |
P2 | Significant impact, degraded service | < 30 minutes (business hours) | SMS, email, dedicated channel |
P3 | Minor impact, non-critical issues | < 4 hours (business hours) | Email, general channel |
P4 | Low impact, cosmetic issues | Next business day | Email, ticket |
Include essential information in alerts:
- Clear, actionable title
- Severity level
- Affected system and component
- Issue summary
- Impact assessment
- Suggested remediation steps
- Link to runbook
- Alert context and relevant metrics
Example alert template:
[P1] Payment Service: High Error Rate (25%) exceeding threshold (5%)
Affected Service: Payment Processing API
Time Detected: 2023-05-08 12:34:56 UTC
Impact: Users unable to complete payments
Metric: error_rate = 25%
Possible Causes:
- Database connectivity issues
- External payment gateway failures
- Recent deployment issues
Suggested Actions:
1. Check payment gateway status
2. Verify database connectivity
3. Review recent deployments
Runbook: https://runbooks.bayat.io/payment-service/high-error-rate
Dashboard: https://dashboards.bayat.io/payment-service
Design effective alert rules:
- Alert on symptoms, not causes
- Avoid alert fatigue through proper thresholds
- Use dynamic thresholds where appropriate
- Implement alert grouping and correlation
- Include buffer periods for transient issues
- Document the rationale for each alert
Define clear alert routing:
- Route alerts to the responsible team
- Use escalation policies for unacknowledged alerts
- Maintain up-to-date on-call schedules
- Provide multiple notification channels
- Document notification preferences
Create consistent dashboards:
- Use a standard template for each service type
- Include service overview and drill-down views
- Display SLIs prominently
- Show correlation between metrics
- Provide links to related resources and documentation
- Include time range selectors
- Use consistent color schemes and layouts
Implement these standard dashboards for all services:
- Service Overview: Key health and performance metrics
- Resource Utilization: CPU, memory, disk, network
- SLI/SLO Dashboard: Service level indicators and objectives
- Dependencies: Health of dependencies and integration points
- Business Metrics: User activity and business outcomes
Follow these visualization guidelines:
- Use appropriate chart types for each metric
- Provide context with thresholds and historical trends
- Label axes and include units
- Use consistent time windows
- Include legends and explanations
- Avoid visual clutter
- Optimize for readability at a glance
Standardize on these monitoring tools:
Category | Approved Tools | Purpose |
---|---|---|
Metrics | Prometheus, Datadog, CloudWatch | Metric collection and analysis |
Logging | ELK Stack, Loki, CloudWatch Logs | Log aggregation and search |
Tracing | Jaeger, Zipkin, X-Ray | Distributed tracing |
Dashboards | Grafana, Datadog, CloudWatch | Visualization |
Alerting | AlertManager, PagerDuty, OpsGenie | Alert management |
Synthetic | Pingdom, Datadog Synthetics | Synthetic monitoring |
APM | New Relic, Datadog APM, Dynatrace | Application performance monitoring |
Real User | Google Analytics, Datadog RUM | Real user monitoring |
Choose monitoring tools based on:
- Integration: Compatibility with existing systems
- Scalability: Ability to handle expected load
- Functionality: Coverage of required capabilities
- Usability: Ease of use for all stakeholders
- Cost: Total cost of ownership
- Support: Vendor support and community
- Security: Security features and compliance
Define monitoring as code:
- Store monitoring definitions in version control
- Use infrastructure as code tools to deploy monitoring
- Apply the same review and approval process as application code
- Test monitoring changes before deployment
- Document monitoring code comprehensively
# Define CloudWatch alarm for API Gateway
resource "aws_cloudwatch_metric_alarm" "api_5xx_error" {
alarm_name = "${var.service_name}-5xx-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "5XXError"
namespace = "AWS/ApiGateway"
period = 60
statistic = "Sum"
threshold = 5
alarm_description = "This alarm monitors for 5XX errors in the API"
dimensions = {
ApiName = var.api_name
Stage = var.environment
}
alarm_actions = [aws_sns_topic.alarm_topic.arn]
ok_actions = [aws_sns_topic.alarm_topic.arn]
tags = local.common_tags
}
# prometheus.yml
scrape_configs:
- job_name: 'api_service'
scrape_interval: 15s
metrics_path: '/metrics'
static_configs:
- targets: ['api-service:8080']
relabel_configs:
- source_labels: [__address__]
target_label: instance
- source_labels: [job]
target_label: service
# alerts.yml
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
team: api
annotations:
summary: "High HTTP error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} for the past 5 minutes"
runbook: "https://runbooks.bayat.io/api/high-error-rate"
Implement multiple health check types:
- Liveness: Confirms the service is running
- Readiness: Confirms the service can handle requests
- Dependency: Confirms dependencies are available
- Functional: Confirms business functions work correctly
- Synthetic: Simulates user interactions
Standardize health check implementation:
- Expose health check endpoints for all services
- Use standard HTTP status codes (200 for healthy, non-200 for unhealthy)
- Include detailed health information in response body
- Set appropriate timeouts and intervals
- Document health check endpoints and expected responses
Example health check response:
{
"status": "healthy",
"version": "1.2.3",
"timestamp": "2023-05-08T12:34:56.789Z",
"dependencies": {
"database": {
"status": "healthy",
"responseTime": 15
},
"paymentGateway": {
"status": "healthy",
"responseTime": 120
},
"notificationService": {
"status": "degraded",
"responseTime": 450,
"message": "High latency detected"
}
}
}
Implement synthetic monitoring for critical paths:
- Define key user journeys to monitor
- Run synthetic tests at regular intervals
- Monitor from multiple geographic locations
- Alert on synthetic test failures
- Include detailed failure information
Define standard SLIs for all services:
- Availability: Percentage of successful requests
- Latency: Request duration at various percentiles
- Error Rate: Percentage of error responses
- Throughput: Requests per second
- Saturation: Resource utilization relative to capacity
Set clear SLOs for each SLI:
- Define measurement windows (e.g., 30 days)
- Set appropriate targets (e.g., 99.9% availability)
- Allocate error budgets
- Document SLO decision rationale
- Review and adjust SLOs regularly
Example SLO definition:
Service: Payment API
SLI: Availability
Definition: Percentage of requests that return a valid response (non-5xx)
Measurement: 30-day rolling window
Target: 99.95%
Error Budget: 0.05% (21.9 minutes per 30 days)
Structure SLAs based on SLOs:
- Set SLA targets below SLO targets (buffer)
- Define clear measurement methodologies
- Specify exclusions and limitations
- Include remediation and reporting requirements
- Document business impact of violations
Establish clear on-call practices:
- Define primary and secondary on-call roles
- Create fair and sustainable rotation schedules
- Document escalation paths
- Provide clear handoff procedures
- Ensure knowledge sharing across the team
Classify incidents by severity:
Severity | Definition | Examples |
---|---|---|
Critical | Complete service outage with significant business impact | Payment system down, authentication failure for all users |
Major | Partial service degradation affecting many users | Slow response times, subset of features unavailable |
Minor | Limited impact affecting few users or non-critical features | Cosmetic issues, isolated errors for specific users |
Define a clear incident response process:
- Detection: Identify the incident through monitoring or reports
- Classification: Determine severity and impact
- Notification: Alert appropriate stakeholders
- Mitigation: Implement immediate fix or workaround
- Resolution: Fully resolve the underlying issue
- Post-mortem: Analyze root cause and implement preventative measures
Document incidents thoroughly:
- Incident summary and timeline
- Impact assessment
- Root cause analysis
- Resolution steps
- Preventative measures
- Lessons learned
Monitor these metrics for capacity planning:
- Resource utilization trends
- Growth rates for users and transactions
- Seasonal patterns
- Peak-to-average ratios
- Resource headroom
Implement capacity forecasting:
- Analyze historical usage patterns
- Project future growth
- Plan for peak events and seasonality
- Document capacity requirements
- Schedule regular capacity reviews
Define guidelines for capacity changes:
- Triggers for horizontal scaling
- Triggers for vertical scaling
- Lead time for capacity changes
- Approval process for significant changes
- Documentation of scaling decisions
Monitor these security-specific metrics:
- Authentication failures
- Authorization violations
- Rate limit breaches
- Unusual access patterns
- Security scan results
- Vulnerability counts
Implement security-specific alerts:
- Excessive authentication failures
- Privilege escalation attempts
- Configuration changes
- Unusual network traffic
- Data exfiltration patterns
- Compliance violations
Address compliance requirements:
- Define compliance-specific monitoring
- Generate required compliance reports
- Maintain audit trails for compliance activities
- Verify monitoring coverage of compliance controls
Track these cost-related metrics:
- Resource costs by service
- Cost per transaction
- Unutilized resources
- Cost anomalies
- Cost trends
Monitor for cost optimization opportunities:
- Overprovisioned resources
- Idle resources
- Inefficient architecture patterns
- Unexpected cost increases
- Resource usage outside business hours
Implement cost allocation monitoring:
- Track costs by team, project, and environment
- Set up budget alerts
- Compare actual costs to budgets
- Identify cost outliers
- Report on cost efficiency metrics
Ensure compliance with monitoring standards:
- Conduct regular monitoring coverage reviews
- Verify SLO measurement and reporting
- Audit alert configurations
- Check dashboard accuracy and completeness
- Validate log coverage and retention
Provide monitoring as a service:
- Create self-service monitoring tools
- Document monitoring onboarding procedures
- Provide monitoring templates and examples
- Offer monitoring consultation for teams
- Establish monitoring support channels
Continuously improve monitoring practices:
- Collect feedback on monitoring effectiveness
- Analyze alert patterns and response times
- Identify monitoring gaps
- Stay current with monitoring technologies
- Update standards based on lessons learned