|
1 | | -# Alloy + OpenTelemetry + Honeycomb Design |
| 1 | +# OpenTelemetry Operator + Honeycomb Design |
2 | 2 |
|
3 | | -**Date**: 2026-03-22 |
| 3 | +**Date**: 2026-03-22 (updated 2026-03-23) |
4 | 4 | **Status**: Implementing |
5 | 5 |
|
6 | 6 | ## Goal |
7 | 7 |
|
8 | | -Deploy Grafana Alloy as a unified OpenTelemetry collector that dual-ships all telemetry to both local Grafana stack and Honeycomb SaaS. This enables learning OTEL while comparing self-hosted vs SaaS observability. |
| 8 | +Deploy the CNCF OpenTelemetry Operator with Collector (agent + gateway) to replace Grafana Alloy. Dual-ships all telemetry to both local Grafana stack and Honeycomb SaaS. Auto-instrumentation enabled for zero-code trace generation. |
9 | 9 |
|
10 | 10 | ## Architecture |
11 | 11 |
|
12 | 12 | ``` |
13 | | - ┌──────────────┐ |
14 | | - │ Honeycomb │ |
15 | | - │ (OTLP HTTP) │ |
16 | | - └──────▲───────┘ |
17 | | - │ |
18 | | -┌───────────────────────────────┼────────────────────────┐ |
19 | | -│ Alloy DaemonSet (ns: alloy) │ │ |
20 | | -│ │ │ |
21 | | -│ ┌──────────────┐ ┌────────┴─────────┐ │ |
22 | | -│ │ Pod log │───▶│ Batch processor │──────┐ │ |
23 | | -│ │ scraping │ │ (5s / 1024 batch)│ │ │ |
24 | | -│ └──────────────┘ └────────┬─────────┘ │ │ |
25 | | -│ │ │ │ |
26 | | -│ ┌──────────────┐ │ │ │ |
27 | | -│ │ OTLP receiver │─────traces─┘ │ │ |
28 | | -│ │ :4317 / :4318 │─────metrics──────────────────┘ │ |
29 | | -│ └──────────────┘ │ |
30 | | -└───────────────────────────────┼────────────────────────┘ |
31 | | - │ |
32 | | - ┌────────────────────┼────────────────────┐ |
33 | | - │ │ │ |
34 | | - ┌──────▼──────┐ ┌────────▼───────┐ ┌────────▼────────┐ |
35 | | - │ Loki Gateway │ │ Tempo :4317 │ │ Prometheus │ |
36 | | - │ (loki-stack) │ │ (monitoring) │ │ remote-write │ |
37 | | - └─────────────┘ └────────────────┘ └─────────────────┘ |
| 13 | +┌─────────────────────────────────────────────────────────────┐ |
| 14 | +│ OTEL Operator (Deployment) │ |
| 15 | +│ - Manages Collector instances via OpenTelemetryCollector CRD│ |
| 16 | +│ - Injects auto-instrumentation via Instrumentation CRD │ |
| 17 | +└─────────────────────────────────────────────────────────────┘ |
| 18 | +
|
| 19 | +┌─────────────────────────────────────────────────────────────┐ |
| 20 | +│ OTEL Collector Agent (DaemonSet) — per node │ |
| 21 | +│ - filelog receiver: scrapes /var/log/pods │ |
| 22 | +│ - otlp receiver: accepts traces/metrics from instrumented │ |
| 23 | +│ apps on :4317/:4318 │ |
| 24 | +│ - Forwards all signals to Gateway via OTLP gRPC │ |
| 25 | +└──────────────────────────┬──────────────────────────────────┘ |
| 26 | + │ OTLP gRPC |
| 27 | +┌──────────────────────────▼──────────────────────────────────┐ |
| 28 | +│ OTEL Collector Gateway (Deployment, 2 replicas) │ |
| 29 | +│ - k8sattributes: enriches with k8s metadata from API │ |
| 30 | +│ - resource: sets service.name, cluster name │ |
| 31 | +│ - batch: 10s / 8192 items │ |
| 32 | +│ - Fan-out to all backends: │ |
| 33 | +│ → Loki via OTLP HTTP (logs) │ |
| 34 | +│ → Tempo via OTLP gRPC (traces) │ |
| 35 | +│ → Prometheus remote-write (metrics) │ |
| 36 | +│ → Honeycomb via OTLP HTTP (everything) │ |
| 37 | +└─────────────────────────────────────────────────────────────┘ |
38 | 38 | ``` |
39 | 39 |
|
40 | 40 | ## Data Flow |
41 | 41 |
|
42 | | -| Signal | Source | Local Destination | Honeycomb | |
43 | | -|---------|---------------------|--------------------------------------------|-----------| |
44 | | -| Logs | Pod stdout/stderr | Loki via loki.write | OTLP HTTP | |
45 | | -| Logs | K8s events | Loki via loki.write | OTLP HTTP | |
46 | | -| Traces | Apps → OTLP :4317/8 | Tempo via OTLP gRPC | OTLP HTTP | |
47 | | -| Metrics | Apps → OTLP :4317/8 | Prometheus via remote-write | OTLP HTTP | |
| 42 | +| Signal | Source | Local Destination | Honeycomb | |
| 43 | +|---------|---------------------------------|---------------------------|------------| |
| 44 | +| Logs | Pod stdout/stderr (filelog) | Loki via OTLP HTTP | OTLP HTTP | |
| 45 | +| Traces | Auto-instrumented apps → OTLP | Tempo via OTLP gRPC | OTLP HTTP | |
| 46 | +| Metrics | Auto-instrumented apps → OTLP | Prometheus remote-write | OTLP HTTP | |
48 | 47 |
|
49 | 48 | ## Components |
50 | 49 |
|
51 | | -### New: `monitoring/alloy/` |
| 50 | +### New: `infrastructure/controllers/opentelemetry-operator/` |
52 | 51 |
|
53 | | -| File | Purpose | |
54 | | -|---------------------|--------------------------------------------| |
55 | | -| `ns.yaml` | Namespace `alloy` | |
56 | | -| `kustomization.yaml`| Helm chart reference (alloy 1.6.2) | |
57 | | -| `values.yaml` | DaemonSet config + Alloy pipeline | |
58 | | -| `externalsecret.yaml`| Honeycomb API key from 1Password | |
| 52 | +| File | Purpose | |
| 53 | +|------------------------|------------------------------------------------------| |
| 54 | +| `ns.yaml` | Namespace `opentelemetry` | |
| 55 | +| `kustomization.yaml` | Helm chart (opentelemetry-operator 0.105.1) | |
| 56 | +| `values.yaml` | Operator config, cert-manager webhooks | |
| 57 | +| `externalsecret.yaml` | Honeycomb API key from 1Password | |
| 58 | +| `collector-agent.yaml` | OpenTelemetryCollector CRD (DaemonSet mode) | |
| 59 | +| `collector-gateway.yaml`| OpenTelemetryCollector CRD (Deployment mode) | |
| 60 | +| `instrumentation.yaml` | Instrumentation CRD (auto-inject config) | |
59 | 61 |
|
60 | | -### Modified: `monitoring/tempo/values.yaml` |
| 62 | +### Modified: `infrastructure/controllers/argocd/apps/infrastructure-appset.yaml` |
61 | 63 |
|
62 | | -Added OTLP gRPC (:4317) and HTTP (:4318) receivers so Tempo accepts traces from Alloy. |
| 64 | +Added `infrastructure/controllers/opentelemetry-operator` to the explicit path list. |
| 65 | + |
| 66 | +### Modified (earlier): `monitoring/tempo/values.yaml` |
| 67 | + |
| 68 | +Added OTLP gRPC (:4317) and HTTP (:4318) receivers. |
| 69 | + |
| 70 | +### Deleted: `monitoring/alloy/` |
| 71 | + |
| 72 | +Entire directory removed — replaced by OTEL Operator + Collector. |
63 | 73 |
|
64 | 74 | ## Secrets |
65 | 75 |
|
66 | | -| Secret | Namespace | 1Password Key | Property | |
67 | | -|---------------------|-----------|---------------|------------| |
68 | | -| `honeycomb-api-key` | `alloy` | `honeycomb` | `api-key` | |
| 76 | +| Secret | Namespace | 1Password Key | Property | |
| 77 | +|---------------------|----------------|---------------|------------| |
| 78 | +| `honeycomb-api-key` | `opentelemetry`| `honeycomb` | `api-key` | |
69 | 79 |
|
70 | | -## How Apps Send Telemetry |
| 80 | +## Auto-Instrumentation |
71 | 81 |
|
72 | | -Apps instrumented with OTEL SDKs should set their exporter endpoint to: |
| 82 | +Apps opt-in by adding an annotation to their Deployment: |
73 | 83 |
|
74 | | -``` |
75 | | -OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy.alloy.svc.cluster.local:4317 |
| 84 | +```yaml |
| 85 | +metadata: |
| 86 | + annotations: |
| 87 | + instrumentation.opentelemetry.io/inject-python: "true" |
| 88 | + # or: inject-nodejs, inject-java, inject-go, inject-dotnet |
76 | 89 | ``` |
77 | 90 |
|
78 | | -Alloy handles the fan-out to all backends. |
| 91 | +The Operator's webhook injects an init container with the OTEL SDK. The app automatically generates traces sent to the Agent's OTLP endpoint. |
79 | 92 |
|
80 | 93 | ## Deployment |
81 | 94 |
|
82 | | -Auto-discovered by the monitoring AppSet (`monitoring/*` glob) at sync wave 5. No manual Application resource needed. |
| 95 | +Deployed via the infrastructure AppSet at sync wave 4. The Operator needs cert-manager for webhook TLS (cert-manager is already in the infrastructure AppSet). |
| 96 | + |
| 97 | +## RBAC |
| 98 | + |
| 99 | +The Operator creates ServiceAccounts for the Collectors. The gateway's `otel-gateway` SA needs RBAC to list/watch pods for the `k8sattributes` processor. The Operator handles this automatically. |
0 commit comments