You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: acd.md
+17-23Lines changed: 17 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,26 +19,25 @@
19
19
## 1. EXECUTIVE SUMMARY
20
20
21
21
This Architectural Concept Document (ACD) presents a Proof of Concept (POC) for implementing OpenTelemetry (Otel) SDK’s logging features in
22
-
a distributed architecture. The POC centers on a Recommendation Service generating log data, which traverses several processing layers en
23
-
route to Audit Log Services V3. The objective is to test logging message integrity, identify data loss points, and optimize telemetry flows
22
+
a distributed architecture. The POC centers on a Recommendation Service generating log data, which traverses several processing layers. The objective is to test logging message integrity, identify data loss points, and optimize telemetry flows
24
23
for robust observability.
25
24
26
25
---
27
26
28
27
## 2. INTRODUCTION
29
28
30
29
This document details a technical blueprint for leveraging OpenTelemetry’s logging SDK within a cloud-native architecture. The focus is to
31
-
assess potential logging message loss and performance bottlenecks, primarily within the Recommendation Service and its downstream audit log
30
+
assess potential logging message loss and performance bottlenecks, primarily within the Recommendation Service and its downstream log sink
32
31
pipeline.
33
32
34
33
---
35
34
36
35
## 3. BUSINESS CASE
37
36
38
-
Ensuring audit logs are reliably captured and transmitted is critical for compliance, troubleshooting, and operational visibility. The
37
+
Ensuring logs are reliably captured and transmitted is critical for compliance, troubleshooting, and operational visibility. The
39
38
adoption of OpenTelemetry promises unified observability but raises questions regarding potential data loss and reliability, particularly
40
39
when logs traverse complex or unreliable network paths. This POC provides a structured method to evaluate, optimize, and ultimately
Use OpenTelemetry SDK within application code for cross-vendor and standardized telemetry generation. Externalize processing to Otel
65
-
Collector for operational flexibility without code deployment. Employ processors (filtering, batching) for scaling and compliance with
63
+
Collector for operational flexibility. Employ processors (filtering, batching) for scaling and compliance with
66
64
remote API limits. Decouple network transmission from application code, handing over all egress responsibilities to Otel Collector.
67
-
Instrument with checkpoints and monitoring at each component boundary for reliability assessment. Select AuditLog Services V3 Exporter due
68
-
to organizational integration requirements.
65
+
Instrument with checkpoints and monitoring at each component boundary for reliability assessment.
69
66
70
67
## 6. OPEN POINTS
71
68
72
-
Otel SDK & Collector Version Compatibility: Need to validate if all required features and data formats are supported. API Rate Limits &
73
-
Back-pressure: How will surges and API slowdowns/throttling be gracefully handled? Data Privacy & Security: Ensure logging data is
74
-
sanitized/encrypted as required before egress. Collector Failure Modes: What happens to logs if Otel Collector crashes or network partition
75
-
occurs? Lossy Operations in Processors: Need clear bounds on filtering/batching impacts to log completeness.
69
+
Otel SDK & Collector Version Compatibility: Need to validate if all required features and data formats are supported.
70
+
API Rate Limits & Back-pressure: How will surges and API slowdowns/throttling be gracefully handled?
71
+
Data Privacy & Security: Ensure logging data is sanitized/encrypted as required before egress.
72
+
Collector Failure Modes: What happens to logs if Otel Collector crashes or network partition occurs?
73
+
Lossy Operations in Processors: Need clear bounds on filtering/batching impacts to log completeness.
76
74
77
75
## 7. CONCLUSION AND NEXT STEPS
78
76
79
-
This POC will validate the comprehensive logging flow’s reliability and highlight improvements for audit log delivery. Next steps include:
77
+
This POC will validate the comprehensive logging flow’s reliability and highlights findings if there are any loss of logs as per the delivery gurantee.
80
78
81
-
Building and deploying test harnesses for each stage. Executing validation and stress tests. Analyzing end-to-end message integrity/loss
82
-
metrics. Tuning collector/processors for optimal throughput and minimal loss. Compiling a findings and recommendations report for broader
83
-
system rollout.
79
+
Next steps include: Building and deploying test harnesses for each stage. Executing validation and stress tests. Analyzing end-to-end message integrity/loss metrics. Tuning collector/processors for optimal throughput and minimal loss. Compiling a findings and recommendations report for broader system rollout.
84
80
85
81
## 8. DECISION PROTOCOL
86
82
87
-
Decisions Tracked: All key design changes/choices documented in versioned change log. Review Frequency: Weekly checkpoints during POC,
83
+
Decisions Tracked: All key design changes/choices documented in versioned change log.
84
+
Review Frequency: Weekly checkpoints during POC,
88
85
rolling up to steering committee.
89
86
90
-
## 9. APPENDIX
91
-
92
-
References to OpenTelemetry documentation Diagrams (link/attachments) API schemas and configs Example log events Test plans and scripts
0 commit comments