This guidance paper explains recommended practices and an operational architecture for building a highly reliable audit logging pipeline using OpenTelemetry (OTel). Its primary goal is to help architects, platform engineers and application developers design audit log delivery that minimizes data loss, supports long retention, and remains operationally manageable.
Purpose and motivation:
- Provide concise, actionable recommendations for each layer of the pipeline (Client SDK → Collector → Final Storage) so teams can make consistent design choices across environments.
- Emphasize durability and predictable retention over low latency: audit logs often have legal and compliance requirements where losing events is unacceptable and duplicates are preferable to missing data.
- Reduce operational complexity by recommending focused components (dedicated collector pipelines, node‑local buffering where appropriate, and clear monitoring/alerting points) rather than mixing audit logs with high‑volume telemetry.
Audience:
- Platform and SRE teams building or operating OTel collection and processing infrastructure.
- Application and SDK developers who implement audit logging clients and integrations.
- Security, compliance, and data governance teams evaluating retention and immutable storage requirements.
In scope: reliable delivery patterns, buffering strategies (client vs. agent), collector configuration recommendations, monitoring and runbooks, and guidance for final sink durability and compliance controls.
Primary goals:
- No audit event loss ("at least once" delivery – duplicates acceptable, loss is not).
- Clear separation of audit logs from regular application telemetry (logs/traces/metrics) to avoid resource contention.
- Predictable retention and compliance (e.g. 10+ years depending on regulation).
- Operable at scale with simple failure recovery.
Non-goals:
Out of scope: vendor‑specific implementation details for every storage backend, full legal retention policy text, and high‑performance telemetry optimizations that trade durability for throughput.
- Ultra low latency for audit logs (latency is secondary to durability).
- Mixing audit and high-volume debug/info logs in the same pipeline.
- Client SDK Tier: Produces audit log records and sends them directly to a dedicated Audit OpenTelemetry (OTel) Collector endpoint.
- Collector Tier: Persists and forwards audit logs using durable queues to the Final Storage Sink. Acts only as a buffering and routing layer.
- Final Storage Sink Tier: Long-term storage (e.g. SIEM, Data Lake, search cluster) with backups and legal retention controls.
Rationale: Separating tiers isolates failure domains (client, transport, storage) and allows independent scaling and policy enforcement.
Use a dedicated logger / pipeline for audit logs distinct from your regular application logging/tracing pipeline.
Recommended components:
- LoggerProvider with a Simple (non-batching) LogRecordProcessor.
- A custom AuditLogRecordProcessor ensuring durability (e.g. persistent local queue, no dropping when queue is full).
- OTLP (OpenTelemetry Protocol) LogRecordExporter configured with retry and a dedicated endpoint
(
setEndpoint()) pointing to the audit collector.
Why no batching at client side for audit logs?:
- Batching trades reliability for throughput and can increase loss risk during crashes.
Key settings:
- Enable persistent local buffering (e.g. file-based) where available.
- Use exponential backoff retries with upper bound.
- Prefer gRPC OTLP export for efficiency; HTTP may be used for constrained environments.
Failure considerations:
- If the network is down, client-side persistence must retain events until restored.
- If local disk fills, trigger alerts; do not silently discard audit entries.
When using a node-local OpenTelemetry Collector agent (e.g. run as sidecar) to buffer audit logs, the agent receives log traffic from applications on the same node and persists it to the node filesystem before forwarding to the central collector or sink. This topology moves the responsibility for short-term durability from each application to a local agent and is operationally convenient in many environments.
Key benefits:
- Reduced network dependency: App-to-agent traffic stays on the same node, lowering cross-node network failure exposure.
- Simpler application footprint: Applications avoid implementing per-app persistent queues and complex stateful deployments.
Risks and operational tradeoffs:
- Node failure exposure: If a node is lost, any locally buffered data becomes unavailable unless additional replication or backup is in place.
- Vertical scaling limits: Large buffering needs require nodes with large disks; storage does not scale transparently across nodes.
- Upgrade/maintenance impact: During agent downtime (e.g. rolling upgrades), applications may experience drops unless retry/backoff is carefully tuned.
Operational recommendations:
- Monitor local queue depth and age; alert when oldest events exceed your SLA.
- Use node‑local persistent volumes with appropriate capacity planning.
- Ensure safe decommissioning: drain or wait for agent buffers to drain before removing nodes. (e.g. Gardener node hibernation)
- Consider the agent topology for environments that can tolerate limited node‑level blast radius; prefer distributed storage if node loss is unacceptable.
Run a dedicated OTel Collector instance (or set of instances) for audit logs – do not share with high-volume telemetry.
Principles:
- Use persistent sending queue (export helper v2) – never the deprecated batch processor for critical audit paths.
- Only add batching if load metrics prove necessary (opt-in, not default).
- Treat the collector storage as transient, not authoritative.
Configuration Example (config.yaml):
extensions:
# See: https://opentelemetry.io/docs/collector/configuration/#extensions
file_storage:
directory: /var/lib/otelcol/storage
create_directory: true
health_check:
endpoint: ${env:MY_POD_IP}:13133
receivers:
# See: https://opentelemetry.io/docs/collector/configuration/#receivers
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors: {} # See: https://opentelemetry.io/docs/collector/configuration/#processors
exporters:
# See: https://opentelemetry.io/docs/collector/configuration/#exporters
otlp:
endpoint: log-sink:4317
# See: https://github.com/open-telemetry/opentelemetry-collector/issues/8122
sending_queue:
enabled: true
storage: file_storage
retry_on_failure:
enabled: true
service:
# See: https://opentelemetry.io/docs/collector/configuration/#service
extensions: [file_storage, health_check]
pipelines:
logs:
receivers: [otlp]
processors: []
exporters: [otlp]Operational Notes:
- Monitor queue depth; only decommission a node when its persistent queue is empty.
- Health check endpoint must be scraped; failing health triggers remediation.
- Use node-local filesystem (cluster node persistent path) to minimize latency; weigh trade-offs vs. network-attached volumes.
Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch. Normally running as StatefulSets or managed services.
Requirements:
- Persistent Volumes (PV) with regular backups and tested restore workflows.
- WORM (Write Once Read Many) or immutability features where regulation requires.
- Indexing strategy permitting multi-year retention (tiered storage, cold archive, glacier-like deep storage).
- Access controls & audit trails on read operations.
- Encryption at rest.
Do not rely on the Collector for long-term retention; it is transient.
Desired delivery: At least once.
- In case of unavailability of sinks, prefer duplicates over loss.
- Duplicate detection (idempotency) can be handled downstream using event IDs.
- Include a stable unique identifier in each audit log to allow for downstream de-duplication.
Loss Prevention Layers:
- Client: local persistence (disk queue) – crash resiliency.
- Collector: persistent sending queue – network / downstream outage buffering.
- Final Sink: durability (redundant storage (e.g. RAID), backups).
Monitoring of the involved components of the data delivery stack is critical, as it will unveil upcoming threats of data loss early and can be used to trigger remediation actions before data loss occurs. In a distributed system, where delivery can never be 100% guaranteed, monitoring is crucial to get at least close to 100%.
Track and alert on:
- Client queue size & age (oldest event timestamp).
- Collector sending_queue depth and retry counts, failed requests counts.
- Export latency (p50, p95) vs. SLOs.
- Final sink ingestion lag (difference between event time and indexed time).
- Storage capacity thresholds and projections (time to full).
Set thresholds for proactive scaling:
- If queue age > defined SLA (e.g. 5 min), investigate network/backpressure.
- If retry rate spikes, check sink health.
- If storage capacity exceeds threshold, take care of storage extension and/or investigate network/backpressure.
Horizontal scaling points:
- Add more Collector instances behind DNS / load balancer for increased ingestion throughput.
- Partition audit logs by tenant or domain if cardinality grows and spread out to tenant or domain-specific Collectors.
Client side remains lightweight due to no batching – CPU overhead minimal.
- Encrypt in transit (TLS for OTLP gRPC/HTTP). Mutual TLS for sensitive environments.
- Restrict Collector endpoints with network policy (only allow known client subnets).
- Sign or hash audit events (optional) for tamper detection before storage.
- Maintain immutable backups; test restore quarterly.
PII Handling:
- Classify fields; avoid storing unnecessary personal data.
- Apply tokenization or pseudonymization where feasible before export.
- If necessary, encrypt sensitive fields at the application level.
| Failure Mode | Mitigation |
|---|---|
| Client crash | Retrieve unsent Audit Logs from local disk queue + resend to collector |
| Network outage | Client queue grows; alert on age; Collector persistent queue buffers |
| Collector restart | file_storage + sending_queue preserves state |
| Final sink outage | Collector retry + queue; monitor depth |
| Disk full (client/collector) | Alerts; autoscale storage; pause intake if critical |
| Data corruption | Checksums / hashing; sink replication; backup restore |
Client SDK:
- Dedicated Audit LoggerProvider
- Simple (non-batching) or AuditLogRecordProcessor
- Persistent local queue enabled
- OTLP exporter with retry + endpoint isolation
Collector:
- Dedicated deployment (not shared with high-volume telemetry)
-
sending_queueenabled withfile_storage - Health check monitored
- Queue depth metric alerts configured
Final Sink:
- Retention policy documented
- Backups + restore runbook
- Access control & audit of reads
- Capacity & cost monitoring
Cross-Cutting:
- TLS enabled end-to-end
- Unique event IDs used inside audit log events
- Monitoring dashboards created (queues, latency, errors)
- Runbook for each failure mode
Acronyms / Terms:
- OTel / OpenTelemetry: Open-source observability framework (https://opentelemetry.io/)
- OTLP: OpenTelemetry Protocol (https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/protocol)
- gRPC: High-performance RPC framework (https://grpc.io/)
- SIEM: Security Information and Event Management.
- PV: Persistent Volume (Kubernetes storage abstraction).
- SLO: Service Level Objective.
- WORM: Write Once Read Many storage model.
- TLS: Transport Layer Security.
- PII: Personally Identifiable Information.
Referenced Issues / PRs:
- Export helper with persistent queue (Batch v2 discussion): #8122
- Proposed AuditLogRecordProcessor (Java PR): AuditLogRecordProcessor
- Collector scaling guidance: Scaling the Collector