66This document summarizes lessons learned from building resilient OpenTelemetry (OTel) log pipelines for audit-grade reliability. It proposes
77best practices and incremental improvements for the open-source community.
88
9- ## 1. Problem Statement
9+ ## Problem Statement
1010
1111Organizations with regulatory or forensic needs require extremely low probability of log loss from Application SDK → Collector tiers →
1212Intermediate durability → Final sink. Today OTel offers building blocks (retry, batching, queues, WAL, message queues) but lacks an
@@ -33,7 +33,7 @@ reliability, particularly the migration towards exporter-native batching as trac
3333(#8122 )] [ batchv2 ] ** . This PoC and its findings are intended to provide tangible data and a reference implementation for Phase 2 of the Audit
3434Logging SIG's charter: contributing functional extensions back upstream.
3535
36- ## 2. Canonical Delivery Path & Loss Points
36+ ## Canonical Delivery Path & Loss Points
3737
3838| Stage | Component | Loss Modes (Today) | Observability Gaps |
3939| ----- | ----------------------------------------- | --------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
@@ -75,7 +75,7 @@ flowchart LR
7575 classDef lossPoint fill:#ffe0e0,stroke:#ff5555,stroke-width:1px
7676```
7777
78- ## 3. Failure Class Taxonomy
78+ ## Failure Class Taxonomy
7979
8080| Class | Examples | Mitigation |
8181| ----------------- | ----------------------------- | ---------------------------------------- |
@@ -87,7 +87,7 @@ flowchart LR
8787| Systemic Outage | long backend downtime | Durable MQ + extended retention |
8888| Data Tampering | on-node modification | Integrity hashing & encryption (future) |
8989
90- ## 4. Current Best Practices
90+ ## Current Best Practices
9191
9292### Application / SDK
9393
@@ -128,7 +128,7 @@ flowchart LR
128128- Hash chaining per batch in persistent queue.
129129- Pluggable encryption at Client SDK for regulated domains.
130130
131- ## 5. Lessons Learned
131+ ## Lessons Learned
132132
133133| Area | Insight |
134134| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
@@ -144,7 +144,9 @@ flowchart LR
144144| Failover Visibility | Limited built-in metrics for failover transitions and active priority level. |
145145| Connector Loss Attribution | Drops inside connectors (routing/aggregation) often misattributed to exporters. |
146146
147- ## 6. Improvement Proposals (Candidate OTEPs)
147+ ## Improvements
148+
149+ ### Improvement Proposals (Candidate OTEPs)
148150
149151| ID | Proposal | Benefit | Effort | Trade-Off | Affected Component(s) |
150152| --- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------- | ----------------------------- | ------------------------ |
@@ -169,20 +171,20 @@ flowchart LR
169171| P19 | Durability mode annotation metric (` pipeline_durability_mode ` ) | Auditability | Low | More time series | Collector |
170172| P20 | Connector backpressure hook (upstream throttle signal) | Unified flow control | High | Cross-component changes | API, Collector (contrib) |
171173
172- ## 7. Prioritized Actions
174+ ### Prioritized Actions
173175
1741761 . Align timeouts (docs), ensure retry - especially when connection loss/establishment is involved. Might require code changes in some
175177 dependencies (gRPC/http libraries) or in their usage.
1761782 . Focus on Client SDK persistency (+ retry).
1771793 . Double check relevant metrics around queuing, batching, timeouts and retries etc.
178180
179- ## 8. Longer-Term Experiments
181+ ### Longer-Term Experiments
180182
181183- Adaptive hybrid queue state machine (NORMAL → DEGRADED → RECOVERY).
182184- Backpressure signaling spec extension (HTTP header / gRPC status mapping).
183185- Integrity hashing plugin reference implementation.
184186
185- ## 9. Risk & Trade-Off Matrix
187+ ## Risk & Trade-Off Matrix
186188
187189| Change | Risk | Mitigation |
188190| --------------------------------- | --------------------------- | ----------------------------------- |
@@ -195,7 +197,7 @@ flowchart LR
195197| Failover flapping | Oscillation, burst pressure | Cooldown + hysteresis metrics (P14) |
196198| Connector telemetry expansion | Metric overhead | Limit enum set; sampling if needed |
197199
198- ## 10. Success Metrics
200+ ## Success Metrics
199201
200202| Metric | Target |
201203| ---------------------------------------------- | ---------------------- |
@@ -208,7 +210,7 @@ flowchart LR
208210| Mean failover detection time | < 10s |
209211| Unexplained connector-origin drops | < 5% of total drops |
210212
211- ## 11. Gold Pipeline Checklist
213+ ## Gold Pipeline Checklist
212214
213215| Layer | Mandatory | Optional (High Assurance) |
214216| ------------------- | -------------------------------------------------------------- | ---------------------------- |
@@ -218,22 +220,22 @@ flowchart LR
218220| Monitoring | Drop reasons dashboard | Automated chaos drills |
219221| Failover (optional) | Failover connector with telemetry (active level & transitions) | Graceful drain before switch |
220222
221- ## 12. Open Questions
223+ ## Open Questions
222224
2232251 . Scope definition: Should "Guaranteed" explicitly declare catastrophe boundaries?
2242262 . Introduce ` durability_level ` config attribute to drive auto defaults?
2252273 . Integrity hashing: in-scope for OTel or delegated downstream?
2262284 . Metric design: single counter with reason label vs multiple counters?
2272295 . Adaptive queue: start as experimental extension before spec adoption?
228230
229- ## 13. Call to Action
231+ ## Call to Action
230232
231233- Comment on proposal prioritization (P1–P12).
232234- Volunteer for metric taxonomy (P1) and queue byte limit (P4) implementation.
233235- Share real incident postmortems for timeout/loss attribution.
234236- Provide fsync performance benchmark data (interval vs none).
235237
236- ## 14. Timeout Chain Interaction Diagram
238+ ## Timeout Chain Interaction Diagram
237239
238240Effective retry window = MIN(ExporterOperationTimeout, BatchProcessorExportTimeout, ContextCancellation) — if < RetryMaxElapsedTime then
239241“latent” configuration hazard → emit warning.
@@ -261,7 +263,7 @@ sequenceDiagram
261263 App->>App: Drop or Retry higher level (if configured)
262264```
263265
264- ## 15. Draft Drop Reason Enum
266+ ## Draft Drop Reason Enum
265267
266268` queue_full ` , ` disk_full ` , ` shutdown_drain_timeout ` , ` retry_exhausted ` , ` serialization_error ` , ` network_unreachable ` , ` backend_rejected ` ,
267269` integrity_failed ` (future), ` connector_failure ` , ` routing_unmatched ` , ` failover_during_switch ` , ` unknown ` .
0 commit comments