@@ -53,6 +53,34 @@ Failure considerations:
5353- If the network is down, client-side persistence must retain events until restored.
5454- If local disk fills, trigger alerts; do not silently discard audit entries.
5555
56+ ### Collector Agent with Local Persistence
57+
58+ When using a node-local OpenTelemetry Collector agent (e.g. run as sidecar) to buffer audit logs, the agent receives log traffic from
59+ applications on the same node and persists it to the node filesystem before forwarding to the central collector or sink. This topology moves
60+ the responsibility for short-term durability from each application to a local agent and is operationally convenient in many environments.
61+
62+ Key benefits:
63+
64+ - [ Reduced network dependency] ( https://kubernetes.io/docs/reference/networking/virtual-ips/#internal-traffic-policy ) : App-to-agent traffic
65+ stays on the same node, lowering cross-node network failure exposure.
66+ - Simpler application footprint: Applications avoid implementing per-app persistent queues and complex stateful deployments.
67+
68+ Risks and operational tradeoffs:
69+
70+ - Node failure exposure: If a node is lost, any locally buffered data becomes unavailable unless additional replication or backup is in
71+ place.
72+ - Vertical scaling limits: Large buffering needs require nodes with large disks; storage does not scale transparently across nodes.
73+ - Upgrade/maintenance impact: During agent downtime (e.g. rolling upgrades), applications may experience drops unless retry/backoff is
74+ carefully tuned.
75+
76+ Operational recommendations:
77+
78+ - Monitor local queue depth and age; alert when oldest events exceed your SLA.
79+ - Use node‑local persistent volumes with appropriate capacity planning.
80+ - Ensure safe decommissioning: drain or wait for agent buffers to drain before removing nodes. (e.g. Gardener node hibernation)
81+ - Consider the agent topology for environments that can tolerate limited node‑level blast radius; prefer distributed storage if node loss is
82+ unacceptable.
83+
5684## 2. Collector Tier Guidelines
5785
5886Run a dedicated OTel Collector instance (or set of instances) for audit logs – do not share with high-volume telemetry.
@@ -114,7 +142,8 @@ Operational Notes:
114142
115143## 3. Final Storage Sink Tier Guidelines
116144
117- Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch.
145+ Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch. Normally running as
146+ StatefulSets or managed services.
118147
119148Requirements:
120149
0 commit comments