Skip to content

Commit ce25d50

Browse files
committed
Collector Agent with Local Persistence
Signed-off-by: Hilmar Falkenberg <hilmar.falkenberg@sap.com>
1 parent 05fbc6a commit ce25d50

File tree

1 file changed

+30
-1
lines changed

1 file changed

+30
-1
lines changed

docs/ideal-setup.md

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,34 @@ Failure considerations:
5353
- If the network is down, client-side persistence must retain events until restored.
5454
- If local disk fills, trigger alerts; do not silently discard audit entries.
5555

56+
### Collector Agent with Local Persistence
57+
58+
When using a node-local OpenTelemetry Collector agent (e.g. run as sidecar) to buffer audit logs, the agent receives log traffic from
59+
applications on the same node and persists it to the node filesystem before forwarding to the central collector or sink. This topology moves
60+
the responsibility for short-term durability from each application to a local agent and is operationally convenient in many environments.
61+
62+
Key benefits:
63+
64+
- [Reduced network dependency](https://kubernetes.io/docs/reference/networking/virtual-ips/#internal-traffic-policy): App-to-agent traffic
65+
stays on the same node, lowering cross-node network failure exposure.
66+
- Simpler application footprint: Applications avoid implementing per-app persistent queues and complex stateful deployments.
67+
68+
Risks and operational tradeoffs:
69+
70+
- Node failure exposure: If a node is lost, any locally buffered data becomes unavailable unless additional replication or backup is in
71+
place.
72+
- Vertical scaling limits: Large buffering needs require nodes with large disks; storage does not scale transparently across nodes.
73+
- Upgrade/maintenance impact: During agent downtime (e.g. rolling upgrades), applications may experience drops unless retry/backoff is
74+
carefully tuned.
75+
76+
Operational recommendations:
77+
78+
- Monitor local queue depth and age; alert when oldest events exceed your SLA.
79+
- Use node‑local persistent volumes with appropriate capacity planning.
80+
- Ensure safe decommissioning: drain or wait for agent buffers to drain before removing nodes. (e.g. Gardener node hibernation)
81+
- Consider the agent topology for environments that can tolerate limited node‑level blast radius; prefer distributed storage if node loss is
82+
unacceptable.
83+
5684
## 2. Collector Tier Guidelines
5785

5886
Run a dedicated OTel Collector instance (or set of instances) for audit logs – do not share with high-volume telemetry.
@@ -114,7 +142,8 @@ Operational Notes:
114142
115143
## 3. Final Storage Sink Tier Guidelines
116144
117-
Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch.
145+
Examples: SIEM (Security Information and Event Management), Data Lake (e.g. S3/GCS/HDFS), OpenSearch, Elasticsearch. Normally running as
146+
StatefulSets or managed services.
118147
119148
Requirements:
120149

0 commit comments

Comments
 (0)