Skip to content

Latest commit

 

History

History
249 lines (185 loc) · 7.61 KB

observability.md

File metadata and controls

249 lines (185 loc) · 7.61 KB

Observability and Telemetry Guidelines

Introduction

Observability is the ability to understand a system's internal state based on its external outputs. Telemetry is the process of collecting data about the performance and behavior of applications and systems. Together, they provide critical insights into system health, performance, and user experience.

Core Pillars of Observability

Logs

Logs are structured or unstructured records of events that occur in a system.

Log Levels

  • ERROR: Errors that prevent normal operation
  • WARN: Potentially harmful situations that don't prevent operation
  • INFO: Informational messages highlighting normal application flow
  • DEBUG: Detailed information for debugging purposes
  • TRACE: Very detailed information, typically only used for intensive debugging

Logging Standards

  • Use structured logging (JSON format) for machine readability
  • Include consistent metadata with each log event:
    • Timestamp (ISO 8601 format)
    • Service/component name
    • Instance ID
    • Correlation ID for tracing requests across services
    • User ID (where appropriate and privacy-compliant)
  • Sensitive information must never be logged (PII, passwords, tokens)
  • Use log rotation to manage log file sizes
  • Implement rate limiting for high-volume log events

Metrics

Metrics are numerical measurements collected at regular intervals.

Types of Metrics

  • Counters: Cumulative measurements that only increase (e.g., request count)
  • Gauges: Point-in-time measurements that can increase or decrease (e.g., memory usage)
  • Histograms: Distribution of measured values (e.g., request duration percentiles)
  • Summaries: Similar to histograms but can calculate percentiles

Standard Metrics to Collect

  • System Metrics:

    • CPU usage
    • Memory usage
    • Disk I/O
    • Network I/O
  • Application Metrics:

    • Request count
    • Error rate
    • Response time (median, 95th, 99th percentiles)
    • Queue depth
    • Active connections
    • Business-specific metrics (e.g., orders processed)
  • Dependency Metrics:

    • External service call latency
    • External service error rate
    • Database query time
    • Cache hit/miss rate

Metric Naming Conventions

  • Use lowercase with underscores
  • Include the measured entity and unit
  • Follow a consistent pattern: <namespace>_<entity>_<action>_<unit>
  • Examples:
    • app_requests_total
    • app_request_duration_seconds
    • system_memory_usage_bytes

Traces

Traces track the flow of requests through distributed systems.

Distributed Tracing Standards

  • Use W3C Trace Context standard for trace propagation
  • Every service-to-service call should propagate trace information
  • Instrument all API endpoints, service calls, database queries, and significant internal operations
  • Include relevant metadata with spans:
    • Service name
    • Operation name
    • Start/end time
    • Parent span ID
    • Tags for filtering
    • Log events for important points in the span

Sampling Strategies

  • 100% sampling of error traces
  • Adaptive sampling based on system load
  • Representative sampling to reduce overhead in high-volume systems
  • Custom sampling rules for critical paths

Implementation Guidelines

Instrumentation

  • Use standard instrumentation libraries for your language/framework
  • Prefer automatic instrumentation where possible
  • Add custom instrumentation for business-critical paths
  • Ensure instrumentation has minimal performance impact
  • Test instrumentation in non-production environments

Data Collection

  • Use agents to collect telemetry data
  • Implement buffering and batching to reduce network overhead
  • Ensure fault tolerance in telemetry collection
  • Consider privacy implications and data retention policies
  • Filter sensitive data before collection

Data Storage

  • Select appropriate storage solutions for different telemetry types:
    • Logs: Elasticsearch, Loki, Splunk
    • Metrics: Prometheus, InfluxDB, Graphite
    • Traces: Jaeger, Zipkin, X-Ray
  • Implement data retention policies
  • Consider hot/warm/cold storage strategies for cost optimization
  • Ensure scalability for increased telemetry volumes

Visualization and Dashboards

  • Create standardized dashboards for common service types
  • Include the following in standard dashboards:
    • Golden signals (latency, traffic, errors, saturation)
    • Service health
    • Business metrics
  • Use consistent layouts and naming
  • Include links between related dashboards
  • Ensure dashboards load quickly and are efficient

Alerts

Alert Design Principles

  • Alert on symptoms, not causes
  • Alert on conditions that require human intervention
  • Ensure alerts include actionable information
  • Design alerts to reduce alert fatigue
  • Include clear remediation steps

Alert Severity Levels

  • Critical: Immediate response required (24/7)
  • High: Response required within hours
  • Medium: Response required within a day
  • Low: Response can be scheduled

Alert Channels

  • Use appropriate channels for different severity levels
  • Implement escalation paths for unacknowledged alerts
  • Ensure on-call rotation and handoff procedures
  • Document alert response procedures

Service Level Objectives (SLOs)

  • Define SLOs for critical user journeys
  • Monitor SLOs continuously
  • Set appropriate time windows for SLO measurement
  • Create alerting based on error budgets
  • Review and adjust SLOs regularly

Tools and Platforms

Recommended Tooling

  • Open Source Stack:

    • Logs: ELK Stack, Loki
    • Metrics: Prometheus, Grafana
    • Tracing: Jaeger, Zipkin
    • All-in-one: OpenTelemetry
  • Commercial Solutions:

    • Datadog
    • New Relic
    • Dynatrace
    • Splunk

Instrumentation Libraries

  • OpenTelemetry (preferred for new projects)
  • Language-specific alternatives if needed

Adoption Strategy

Maturity Model

  1. Basic: Essential system metrics and application logs
  2. Intermediate: Structured logging, application metrics, basic dashboards
  3. Advanced: Distributed tracing, custom business metrics, SLO monitoring
  4. Expert: Full-stack observability, automated anomaly detection, predictive monitoring

Implementation Checklist

  • Instrument application code
  • Configure telemetry collection
  • Set up visualization and dashboards
  • Configure alerts
  • Define and monitor SLOs
  • Document operational procedures
  • Train team on observability tools and practices

Special Considerations

Serverless and Ephemeral Environments

  • Use specialized tools for serverless monitoring
  • Focus on per-invocation metrics
  • Implement custom context propagation
  • Consider cold start monitoring

Container and Kubernetes Environments

  • Monitor both container and host-level metrics
  • Implement Kubernetes-aware monitoring
  • Use service mesh for enhanced observability
  • Consider control plane monitoring

Microservices Architecture

  • Implement consistent instrumentation across services
  • Use service maps for dependency visualization
  • Focus on cross-service request flows
  • Consider implementing API gateways for centralized telemetry

Resources