Skip to content

Latest commit

 

History

History
306 lines (220 loc) · 14.3 KB

File metadata and controls

306 lines (220 loc) · 14.3 KB

Guidance Document: Reliable Audit Logging with OpenTelemetry

This guidance paper explains recommended practices and an operational architecture for building a highly reliable audit logging pipeline using OpenTelemetry (OTel). Its primary goal is to help architects, platform engineers and application developers design audit log delivery that minimizes data loss, supports long retention, and remains operationally manageable.

Purpose and motivation:

  • Provide concise, actionable recommendations for each layer of the pipeline (Client SDK → Collector → Final Storage) so teams can make consistent design choices across environments.
  • Emphasize durability and predictable retention over low latency: audit logs often have legal and compliance requirements where losing events is unacceptable and duplicates are preferable to missing data.
  • Reduce operational complexity by recommending focused components (dedicated collector pipelines, node‑local buffering where appropriate, and clear monitoring/alerting points) rather than mixing audit logs with high‑volume telemetry.

Audience:

  • Platform and SRE teams building or operating OTel collection and processing infrastructure.
  • Application and SDK developers who implement audit logging clients and integrations.
  • Security, compliance, and data governance teams evaluating retention and immutable storage requirements.

Scope & Goals

In scope: reliable delivery patterns, buffering strategies (client vs. agent), collector configuration recommendations, monitoring and runbooks, and guidance for final sink durability and compliance controls.

Primary goals:

  • No audit event loss ("at least once" delivery – duplicates acceptable, loss is not).
  • Clear separation of audit logs from regular application telemetry (logs/traces/metrics) to avoid resource contention.
  • Predictable retention and compliance (e.g. 10+ years depending on regulation).
  • Operable at scale with simple failure recovery.

Non-goals:

Out of scope: vendor‑specific implementation details for every storage backend, full legal retention policy text, and high‑performance telemetry optimizations that trade durability for throughput.

  • Ultra low latency for audit logs (latency is secondary to durability).
  • Mixing audit and high-volume debug/info logs in the same pipeline.

Architecture Overview (3 Tiers)

  1. Client SDK Tier: Produces audit log records and sends them directly to a dedicated Audit OpenTelemetry (OTel) Collector endpoint.
  2. Collector Tier: Persists and forwards audit logs using durable queues to the Final Storage Sink. Acts only as a buffering and routing layer.
  3. Final Storage Sink Tier: Long-term storage (e.g. SIEM, Data Lake, search cluster) with backups and legal retention controls.

Rationale: Separating tiers isolates failure domains (client, transport, storage) and allows independent scaling and policy enforcement.

1. Client SDK Tier Guidelines

Use a dedicated logger / pipeline for audit logs distinct from your regular application logging/tracing pipeline.

Recommended components:

Why no batching at client side for audit logs?:

  • Batching trades reliability for throughput and can increase loss risk during crashes.

Key settings:

  • Enable persistent local buffering (e.g. file-based) where available.
  • Use exponential backoff retries with upper bound.
  • Prefer gRPC OTLP export for efficiency; HTTP may be used for constrained environments.

Failure considerations:

  • If the network is down, client-side persistence must retain events until restored.
  • If local disk fills, trigger alerts; do not silently discard audit entries.

Collector Agent with Local Persistence

When using a node-local OpenTelemetry Collector agent (e.g. run as sidecar) to buffer audit logs, the agent receives log traffic from applications on the same node and persists it to the node filesystem before forwarding to the central collector or sink. This topology moves the responsibility for short-term durability from each application to a local agent and is operationally convenient in many environments.

Key benefits:

  • Reduced network dependency: App-to-agent traffic stays on the same node, lowering cross-node network failure exposure.
  • Simpler application footprint: Applications avoid implementing per-app persistent queues and complex stateful deployments.

Risks and operational tradeoffs:

  • Node failure exposure: If a node is lost, any locally buffered data becomes unavailable unless additional replication or backup is in place.
  • Vertical scaling limits: Large buffering needs require nodes with large disks; storage does not scale transparently across nodes.
  • Upgrade/maintenance impact: During agent downtime (e.g. rolling upgrades), applications may experience drops unless retry/backoff is carefully tuned.

Operational recommendations:

  • Monitor local queue depth and age; alert when oldest events exceed your SLA.
  • Use node‑local persistent volumes with appropriate capacity planning.
  • Ensure safe decommissioning: drain or wait for agent buffers to drain before removing nodes. (e.g. Gardener node hibernation)
  • Consider the agent topology for environments that can tolerate limited node‑level blast radius; prefer distributed storage if node loss is unacceptable.

2. Collector Tier Guidelines

Run a dedicated OTel Collector instance (or set of instances) for audit logs – do not share with high-volume telemetry.

Principles:

  • Use persistent sending queue (export helper v2) – never the deprecated batch processor for critical audit paths.
  • Only add batching if load metrics prove necessary (opt-in, not default).
  • Treat the collector storage as transient, not authoritative.

Configuration Example (config.yaml):

extensions:
  # See: https://opentelemetry.io/docs/collector/configuration/#extensions
  file_storage:
    directory: /var/lib/otelcol/storage
    create_directory: true
  health_check:
    endpoint: ${env:MY_POD_IP}:13133

receivers:
  # See: https://opentelemetry.io/docs/collector/configuration/#receivers
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors: {} # See: https://opentelemetry.io/docs/collector/configuration/#processors

exporters:
  # See: https://opentelemetry.io/docs/collector/configuration/#exporters
  otlp:
    endpoint: log-sink:4317
    # See: https://github.com/open-telemetry/opentelemetry-collector/issues/8122
    sending_queue:
      enabled: true
      storage: file_storage
    retry_on_failure:
      enabled: true

service:
  # See: https://opentelemetry.io/docs/collector/configuration/#service
  extensions: [file_storage, health_check]
  pipelines:
    logs:
      receivers: [otlp]
      processors: []
      exporters: [otlp]

Operational Notes:

  • Monitor queue depth; only decommission a node when its persistent queue is empty.
  • Health check endpoint must be scraped; failing health triggers remediation.
  • Use node-local filesystem (cluster node persistent path) to minimize latency; weigh trade-offs vs. network-attached volumes.

3. Final Storage Sink Tier Guidelines

Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch. Normally running as StatefulSets or managed services.

Requirements:

  • Persistent Volumes (PV) with regular backups and tested restore workflows.
  • WORM (Write Once Read Many) or immutability features where regulation requires.
  • Indexing strategy permitting multi-year retention (tiered storage, cold archive, glacier-like deep storage).
  • Access controls & audit trails on read operations.
  • Encryption at rest.

Do not rely on the Collector for long-term retention; it is transient.

Reliability & Delivery Semantics

Desired delivery: At least once.

  • In case of unavailability of sinks, prefer duplicates over loss.
  • Duplicate detection (idempotency) can be handled downstream using event IDs.
  • Include a stable unique identifier in each audit log to allow for downstream de-duplication.

Loss Prevention Layers:

  1. Client: local persistence (disk queue) – crash resiliency.
  2. Collector: persistent sending queue – network / downstream outage buffering.
  3. Final Sink: durability (redundant storage (e.g. RAID), backups).

Monitoring & Alerting

Monitoring of the involved components of the data delivery stack is critical, as it will unveil upcoming threats of data loss early and can be used to trigger remediation actions before data loss occurs. In a distributed system, where delivery can never be 100% guaranteed, monitoring is crucial to get at least close to 100%.

Track and alert on:

  • Client queue size & age (oldest event timestamp).
  • Collector sending_queue depth and retry counts, failed requests counts.
  • Export latency (p50, p95) vs. SLOs.
  • Final sink ingestion lag (difference between event time and indexed time).
  • Storage capacity thresholds and projections (time to full).

Set thresholds for proactive scaling:

  • If queue age > defined SLA (e.g. 5 min), investigate network/backpressure.
  • If retry rate spikes, check sink health.
  • If storage capacity exceeds threshold, take care of storage extension and/or investigate network/backpressure.

Scaling Strategy

Horizontal scaling points:

  • Add more Collector instances behind DNS / load balancer for increased ingestion throughput.
  • Partition audit logs by tenant or domain if cardinality grows and spread out to tenant or domain-specific Collectors.

Client side remains lightweight due to no batching – CPU overhead minimal.

Security & Compliance

  • Encrypt in transit (TLS for OTLP gRPC/HTTP). Mutual TLS for sensitive environments.
  • Restrict Collector endpoints with network policy (only allow known client subnets).
  • Sign or hash audit events (optional) for tamper detection before storage.
  • Maintain immutable backups; test restore quarterly.

PII Handling:

  • Classify fields; avoid storing unnecessary personal data.
  • Apply tokenization or pseudonymization where feasible before export.
  • If necessary, encrypt sensitive fields at the application level.

Failure Modes & Mitigations

Failure Mode Mitigation
Client crash Retrieve unsent Audit Logs from local disk queue + resend to collector
Network outage Client queue grows; alert on age; Collector persistent queue buffers
Collector restart file_storage + sending_queue preserves state
Final sink outage Collector retry + queue; monitor depth
Disk full (client/collector) Alerts; autoscale storage; pause intake if critical
Data corruption Checksums / hashing; sink replication; backup restore

Implementation Checklist

Client SDK:

Collector:

Final Sink:

  • Retention policy documented
  • Backups + restore runbook
  • Access control & audit of reads
  • Capacity & cost monitoring

Cross-Cutting:

  • TLS enabled end-to-end
  • Unique event IDs used inside audit log events
  • Monitoring dashboards created (queues, latency, errors)
  • Runbook for each failure mode

Glossary & References

Acronyms / Terms:

Referenced Issues / PRs: