Observability is the ability to understand a system's internal state based on its external outputs. Telemetry is the process of collecting data about the performance and behavior of applications and systems. Together, they provide critical insights into system health, performance, and user experience.
Logs are structured or unstructured records of events that occur in a system.
- ERROR: Errors that prevent normal operation
- WARN: Potentially harmful situations that don't prevent operation
- INFO: Informational messages highlighting normal application flow
- DEBUG: Detailed information for debugging purposes
- TRACE: Very detailed information, typically only used for intensive debugging
- Use structured logging (JSON format) for machine readability
- Include consistent metadata with each log event:
- Timestamp (ISO 8601 format)
- Service/component name
- Instance ID
- Correlation ID for tracing requests across services
- User ID (where appropriate and privacy-compliant)
- Sensitive information must never be logged (PII, passwords, tokens)
- Use log rotation to manage log file sizes
- Implement rate limiting for high-volume log events
Metrics are numerical measurements collected at regular intervals.
- Counters: Cumulative measurements that only increase (e.g., request count)
- Gauges: Point-in-time measurements that can increase or decrease (e.g., memory usage)
- Histograms: Distribution of measured values (e.g., request duration percentiles)
- Summaries: Similar to histograms but can calculate percentiles
-
System Metrics:
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
-
Application Metrics:
- Request count
- Error rate
- Response time (median, 95th, 99th percentiles)
- Queue depth
- Active connections
- Business-specific metrics (e.g., orders processed)
-
Dependency Metrics:
- External service call latency
- External service error rate
- Database query time
- Cache hit/miss rate
- Use lowercase with underscores
- Include the measured entity and unit
- Follow a consistent pattern:
<namespace>_<entity>_<action>_<unit>
- Examples:
app_requests_total
app_request_duration_seconds
system_memory_usage_bytes
Traces track the flow of requests through distributed systems.
- Use W3C Trace Context standard for trace propagation
- Every service-to-service call should propagate trace information
- Instrument all API endpoints, service calls, database queries, and significant internal operations
- Include relevant metadata with spans:
- Service name
- Operation name
- Start/end time
- Parent span ID
- Tags for filtering
- Log events for important points in the span
- 100% sampling of error traces
- Adaptive sampling based on system load
- Representative sampling to reduce overhead in high-volume systems
- Custom sampling rules for critical paths
- Use standard instrumentation libraries for your language/framework
- Prefer automatic instrumentation where possible
- Add custom instrumentation for business-critical paths
- Ensure instrumentation has minimal performance impact
- Test instrumentation in non-production environments
- Use agents to collect telemetry data
- Implement buffering and batching to reduce network overhead
- Ensure fault tolerance in telemetry collection
- Consider privacy implications and data retention policies
- Filter sensitive data before collection
- Select appropriate storage solutions for different telemetry types:
- Logs: Elasticsearch, Loki, Splunk
- Metrics: Prometheus, InfluxDB, Graphite
- Traces: Jaeger, Zipkin, X-Ray
- Implement data retention policies
- Consider hot/warm/cold storage strategies for cost optimization
- Ensure scalability for increased telemetry volumes
- Create standardized dashboards for common service types
- Include the following in standard dashboards:
- Golden signals (latency, traffic, errors, saturation)
- Service health
- Business metrics
- Use consistent layouts and naming
- Include links between related dashboards
- Ensure dashboards load quickly and are efficient
- Alert on symptoms, not causes
- Alert on conditions that require human intervention
- Ensure alerts include actionable information
- Design alerts to reduce alert fatigue
- Include clear remediation steps
- Critical: Immediate response required (24/7)
- High: Response required within hours
- Medium: Response required within a day
- Low: Response can be scheduled
- Use appropriate channels for different severity levels
- Implement escalation paths for unacknowledged alerts
- Ensure on-call rotation and handoff procedures
- Document alert response procedures
- Define SLOs for critical user journeys
- Monitor SLOs continuously
- Set appropriate time windows for SLO measurement
- Create alerting based on error budgets
- Review and adjust SLOs regularly
-
Open Source Stack:
- Logs: ELK Stack, Loki
- Metrics: Prometheus, Grafana
- Tracing: Jaeger, Zipkin
- All-in-one: OpenTelemetry
-
Commercial Solutions:
- Datadog
- New Relic
- Dynatrace
- Splunk
- OpenTelemetry (preferred for new projects)
- Language-specific alternatives if needed
- Basic: Essential system metrics and application logs
- Intermediate: Structured logging, application metrics, basic dashboards
- Advanced: Distributed tracing, custom business metrics, SLO monitoring
- Expert: Full-stack observability, automated anomaly detection, predictive monitoring
- Instrument application code
- Configure telemetry collection
- Set up visualization and dashboards
- Configure alerts
- Define and monitor SLOs
- Document operational procedures
- Train team on observability tools and practices
- Use specialized tools for serverless monitoring
- Focus on per-invocation metrics
- Implement custom context propagation
- Consider cold start monitoring
- Monitor both container and host-level metrics
- Implement Kubernetes-aware monitoring
- Use service mesh for enhanced observability
- Consider control plane monitoring
- Implement consistent instrumentation across services
- Use service maps for dependency visualization
- Focus on cross-service request flows
- Consider implementing API gateways for centralized telemetry