Observability and Telemetry Guidelines

Introduction

Observability is the ability to understand a system's internal state based on its external outputs. Telemetry is the process of collecting data about the performance and behavior of applications and systems. Together, they provide critical insights into system health, performance, and user experience.

Core Pillars of Observability

Logs

Logs are structured or unstructured records of events that occur in a system.

Log Levels

ERROR: Errors that prevent normal operation
WARN: Potentially harmful situations that don't prevent operation
INFO: Informational messages highlighting normal application flow
DEBUG: Detailed information for debugging purposes
TRACE: Very detailed information, typically only used for intensive debugging

Logging Standards

Use structured logging (JSON format) for machine readability
Include consistent metadata with each log event:
- Timestamp (ISO 8601 format)
- Service/component name
- Instance ID
- Correlation ID for tracing requests across services
- User ID (where appropriate and privacy-compliant)
Sensitive information must never be logged (PII, passwords, tokens)
Use log rotation to manage log file sizes
Implement rate limiting for high-volume log events

Metrics

Metrics are numerical measurements collected at regular intervals.

Types of Metrics

Counters: Cumulative measurements that only increase (e.g., request count)
Gauges: Point-in-time measurements that can increase or decrease (e.g., memory usage)
Histograms: Distribution of measured values (e.g., request duration percentiles)
Summaries: Similar to histograms but can calculate percentiles

Standard Metrics to Collect

System Metrics:
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
Application Metrics:
- Request count
- Error rate
- Response time (median, 95th, 99th percentiles)
- Queue depth
- Active connections
- Business-specific metrics (e.g., orders processed)
Dependency Metrics:
- External service call latency
- External service error rate
- Database query time
- Cache hit/miss rate

Metric Naming Conventions

Use lowercase with underscores
Include the measured entity and unit
Follow a consistent pattern: <namespace>_<entity>_<action>_<unit>
Examples:
- app_requests_total
- app_request_duration_seconds
- system_memory_usage_bytes

Traces

Traces track the flow of requests through distributed systems.

Distributed Tracing Standards

Use W3C Trace Context standard for trace propagation
Every service-to-service call should propagate trace information
Instrument all API endpoints, service calls, database queries, and significant internal operations
Include relevant metadata with spans:
- Service name
- Operation name
- Start/end time
- Parent span ID
- Tags for filtering
- Log events for important points in the span

Sampling Strategies

100% sampling of error traces
Adaptive sampling based on system load
Representative sampling to reduce overhead in high-volume systems
Custom sampling rules for critical paths

Implementation Guidelines

Instrumentation

Use standard instrumentation libraries for your language/framework
Prefer automatic instrumentation where possible
Add custom instrumentation for business-critical paths
Ensure instrumentation has minimal performance impact
Test instrumentation in non-production environments

Data Collection

Use agents to collect telemetry data
Implement buffering and batching to reduce network overhead
Ensure fault tolerance in telemetry collection
Consider privacy implications and data retention policies
Filter sensitive data before collection

Data Storage

Select appropriate storage solutions for different telemetry types:
- Logs: Elasticsearch, Loki, Splunk
- Metrics: Prometheus, InfluxDB, Graphite
- Traces: Jaeger, Zipkin, X-Ray
Implement data retention policies
Consider hot/warm/cold storage strategies for cost optimization
Ensure scalability for increased telemetry volumes

Visualization and Dashboards

Create standardized dashboards for common service types
Include the following in standard dashboards:
- Golden signals (latency, traffic, errors, saturation)
- Service health
- Business metrics
Use consistent layouts and naming
Include links between related dashboards
Ensure dashboards load quickly and are efficient

Alerts

Alert Design Principles

Alert on symptoms, not causes
Alert on conditions that require human intervention
Ensure alerts include actionable information
Design alerts to reduce alert fatigue
Include clear remediation steps

Alert Severity Levels

Critical: Immediate response required (24/7)
High: Response required within hours
Medium: Response required within a day
Low: Response can be scheduled

Alert Channels

Use appropriate channels for different severity levels
Implement escalation paths for unacknowledged alerts
Ensure on-call rotation and handoff procedures
Document alert response procedures

Service Level Objectives (SLOs)

Define SLOs for critical user journeys
Monitor SLOs continuously
Set appropriate time windows for SLO measurement
Create alerting based on error budgets
Review and adjust SLOs regularly

Tools and Platforms

Recommended Tooling

Open Source Stack:
- Logs: ELK Stack, Loki
- Metrics: Prometheus, Grafana
- Tracing: Jaeger, Zipkin
- All-in-one: OpenTelemetry
Commercial Solutions:
- Datadog
- New Relic
- Dynatrace
- Splunk

Instrumentation Libraries

OpenTelemetry (preferred for new projects)
Language-specific alternatives if needed

Adoption Strategy

Maturity Model

Basic: Essential system metrics and application logs
Intermediate: Structured logging, application metrics, basic dashboards
Advanced: Distributed tracing, custom business metrics, SLO monitoring
Expert: Full-stack observability, automated anomaly detection, predictive monitoring

Implementation Checklist

Instrument application code
Configure telemetry collection
Set up visualization and dashboards
Configure alerts
Define and monitor SLOs
Document operational procedures
Train team on observability tools and practices

Special Considerations

Serverless and Ephemeral Environments

Use specialized tools for serverless monitoring
Focus on per-invocation metrics
Implement custom context propagation
Consider cold start monitoring

Container and Kubernetes Environments

Monitor both container and host-level metrics
Implement Kubernetes-aware monitoring
Use service mesh for enhanced observability
Consider control plane monitoring

Microservices Architecture

Implement consistent instrumentation across services
Use service maps for dependency visualization
Focus on cross-service request flows
Consider implementing API gateways for centralized telemetry

Resources

Google SRE Book: Monitoring Distributed Systems
OpenTelemetry Documentation
Distributed Systems Observability
Cloud Native Computing Foundation Observability Projects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability.md

observability.md

Observability and Telemetry Guidelines

Introduction

Core Pillars of Observability

Logs

Log Levels

Logging Standards

Metrics

Types of Metrics

Standard Metrics to Collect

Metric Naming Conventions

Traces

Distributed Tracing Standards

Sampling Strategies

Implementation Guidelines

Instrumentation

Data Collection

Data Storage

Visualization and Dashboards

Alerts

Alert Design Principles

Alert Severity Levels

Alert Channels

Service Level Objectives (SLOs)

Tools and Platforms

Recommended Tooling

Instrumentation Libraries

Adoption Strategy

Maturity Model

Implementation Checklist

Special Considerations

Serverless and Ephemeral Environments

Container and Kubernetes Environments

Microservices Architecture

Resources

Files

observability.md

Latest commit

History

observability.md

File metadata and controls

Observability and Telemetry Guidelines

Introduction

Core Pillars of Observability

Logs

Log Levels

Logging Standards

Metrics

Types of Metrics

Standard Metrics to Collect

Metric Naming Conventions

Traces

Distributed Tracing Standards

Sampling Strategies

Implementation Guidelines

Instrumentation

Data Collection

Data Storage

Visualization and Dashboards

Alerts

Alert Design Principles

Alert Severity Levels

Alert Channels

Service Level Objectives (SLOs)

Tools and Platforms

Recommended Tooling

Instrumentation Libraries

Adoption Strategy

Maturity Model

Implementation Checklist

Special Considerations

Serverless and Ephemeral Environments

Container and Kubernetes Environments

Microservices Architecture

Resources