Skip to content

Commit abd75f1

Browse files
committed
removed numbering
Signed-off-by: Hilmar Falkenberg <hilmar.falkenberg@sap.com>
1 parent 4b0a6fd commit abd75f1

File tree

1 file changed

+17
-15
lines changed

1 file changed

+17
-15
lines changed

docs/community-proposal.md

Lines changed: 17 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
This document summarizes lessons learned from building resilient OpenTelemetry (OTel) log pipelines for audit-grade reliability. It proposes
77
best practices and incremental improvements for the open-source community.
88

9-
## 1. Problem Statement
9+
## Problem Statement
1010

1111
Organizations with regulatory or forensic needs require extremely low probability of log loss from Application SDK → Collector tiers →
1212
Intermediate durability → Final sink. Today OTel offers building blocks (retry, batching, queues, WAL, message queues) but lacks an
@@ -33,7 +33,7 @@ reliability, particularly the migration towards exporter-native batching as trac
3333
(#8122)][batchv2]**. This PoC and its findings are intended to provide tangible data and a reference implementation for Phase 2 of the Audit
3434
Logging SIG's charter: contributing functional extensions back upstream.
3535

36-
## 2. Canonical Delivery Path & Loss Points
36+
## Canonical Delivery Path & Loss Points
3737

3838
| Stage | Component | Loss Modes (Today) | Observability Gaps |
3939
| ----- | ----------------------------------------- | --------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
@@ -75,7 +75,7 @@ flowchart LR
7575
classDef lossPoint fill:#ffe0e0,stroke:#ff5555,stroke-width:1px
7676
```
7777

78-
## 3. Failure Class Taxonomy
78+
## Failure Class Taxonomy
7979

8080
| Class | Examples | Mitigation |
8181
| ----------------- | ----------------------------- | ---------------------------------------- |
@@ -87,7 +87,7 @@ flowchart LR
8787
| Systemic Outage | long backend downtime | Durable MQ + extended retention |
8888
| Data Tampering | on-node modification | Integrity hashing & encryption (future) |
8989

90-
## 4. Current Best Practices
90+
## Current Best Practices
9191

9292
### Application / SDK
9393

@@ -128,7 +128,7 @@ flowchart LR
128128
- Hash chaining per batch in persistent queue.
129129
- Pluggable encryption at Client SDK for regulated domains.
130130

131-
## 5. Lessons Learned
131+
## Lessons Learned
132132

133133
| Area | Insight |
134134
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
@@ -144,7 +144,9 @@ flowchart LR
144144
| Failover Visibility | Limited built-in metrics for failover transitions and active priority level. |
145145
| Connector Loss Attribution | Drops inside connectors (routing/aggregation) often misattributed to exporters. |
146146

147-
## 6. Improvement Proposals (Candidate OTEPs)
147+
## Improvements
148+
149+
### Improvement Proposals (Candidate OTEPs)
148150

149151
| ID | Proposal | Benefit | Effort | Trade-Off | Affected Component(s) |
150152
| --- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | ------- | ----------------------------- | ------------------------ |
@@ -169,20 +171,20 @@ flowchart LR
169171
| P19 | Durability mode annotation metric (`pipeline_durability_mode`) | Auditability | Low | More time series | Collector |
170172
| P20 | Connector backpressure hook (upstream throttle signal) | Unified flow control | High | Cross-component changes | API, Collector (contrib) |
171173

172-
## 7. Prioritized Actions
174+
### Prioritized Actions
173175

174176
1. Align timeouts (docs), ensure retry - especially when connection loss/establishment is involved. Might require code changes in some
175177
dependencies (gRPC/http libraries) or in their usage.
176178
2. Focus on Client SDK persistency (+ retry).
177179
3. Double check relevant metrics around queuing, batching, timeouts and retries etc.
178180

179-
## 8. Longer-Term Experiments
181+
### Longer-Term Experiments
180182

181183
- Adaptive hybrid queue state machine (NORMAL → DEGRADED → RECOVERY).
182184
- Backpressure signaling spec extension (HTTP header / gRPC status mapping).
183185
- Integrity hashing plugin reference implementation.
184186

185-
## 9. Risk & Trade-Off Matrix
187+
## Risk & Trade-Off Matrix
186188

187189
| Change | Risk | Mitigation |
188190
| --------------------------------- | --------------------------- | ----------------------------------- |
@@ -195,7 +197,7 @@ flowchart LR
195197
| Failover flapping | Oscillation, burst pressure | Cooldown + hysteresis metrics (P14) |
196198
| Connector telemetry expansion | Metric overhead | Limit enum set; sampling if needed |
197199

198-
## 10. Success Metrics
200+
## Success Metrics
199201

200202
| Metric | Target |
201203
| ---------------------------------------------- | ---------------------- |
@@ -208,7 +210,7 @@ flowchart LR
208210
| Mean failover detection time | < 10s |
209211
| Unexplained connector-origin drops | < 5% of total drops |
210212

211-
## 11. Gold Pipeline Checklist
213+
## Gold Pipeline Checklist
212214

213215
| Layer | Mandatory | Optional (High Assurance) |
214216
| ------------------- | -------------------------------------------------------------- | ---------------------------- |
@@ -218,22 +220,22 @@ flowchart LR
218220
| Monitoring | Drop reasons dashboard | Automated chaos drills |
219221
| Failover (optional) | Failover connector with telemetry (active level & transitions) | Graceful drain before switch |
220222

221-
## 12. Open Questions
223+
## Open Questions
222224

223225
1. Scope definition: Should "Guaranteed" explicitly declare catastrophe boundaries?
224226
2. Introduce `durability_level` config attribute to drive auto defaults?
225227
3. Integrity hashing: in-scope for OTel or delegated downstream?
226228
4. Metric design: single counter with reason label vs multiple counters?
227229
5. Adaptive queue: start as experimental extension before spec adoption?
228230

229-
## 13. Call to Action
231+
## Call to Action
230232

231233
- Comment on proposal prioritization (P1–P12).
232234
- Volunteer for metric taxonomy (P1) and queue byte limit (P4) implementation.
233235
- Share real incident postmortems for timeout/loss attribution.
234236
- Provide fsync performance benchmark data (interval vs none).
235237

236-
## 14. Timeout Chain Interaction Diagram
238+
## Timeout Chain Interaction Diagram
237239

238240
Effective retry window = MIN(ExporterOperationTimeout, BatchProcessorExportTimeout, ContextCancellation) — if < RetryMaxElapsedTime then
239241
“latent” configuration hazard → emit warning.
@@ -261,7 +263,7 @@ sequenceDiagram
261263
App->>App: Drop or Retry higher level (if configured)
262264
```
263265

264-
## 15. Draft Drop Reason Enum
266+
## Draft Drop Reason Enum
265267

266268
`queue_full`, `disk_full`, `shutdown_drain_timeout`, `retry_exhausted`, `serialization_error`, `network_unreachable`, `backend_rejected`,
267269
`integrity_failed` (future), `connector_failure`, `routing_unmatched`, `failover_during_switch`, `unknown`.

0 commit comments

Comments
 (0)