Skip to content

Commit 684dbd4

Browse files
committed
try otel and honeycomb and alloy?
1 parent 7deeb42 commit 684dbd4

15 files changed

Lines changed: 497 additions & 1044 deletions

CLAUDE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,6 +149,8 @@ Detailed instructions load automatically when working in these directories:
149149
| **Database with CNPG** | `infrastructure/database/cloudnative-pg/immich/` |
150150
| **Database AppSet** | `infrastructure/controllers/argocd/apps/database-appset.yaml` |
151151
| **Gateway API routing** | `infrastructure/networking/gateway/` |
152+
| **OTEL Operator + Collectors** | `infrastructure/controllers/opentelemetry-operator/` |
153+
| **OTEL auto-instrumentation** | `infrastructure/controllers/opentelemetry-operator/instrumentation.yaml` |
152154

153155
## Additional Documentation
154156

@@ -159,3 +161,4 @@ Detailed instructions load automatically when working in these directories:
159161
- **[docs/network-policy.md](docs/network-policy.md)** - Cilium network policies
160162
- **[docs/argocd.md](docs/argocd.md)** - ArgoCD documentation
161163
- **[docs/vpa-resource-optimization.md](docs/vpa-resource-optimization.md)** - VPA auto-scaling
164+
- **[docs/plans/2026-03-22-alloy-otel-honeycomb-design.md](docs/plans/2026-03-22-alloy-otel-honeycomb-design.md)** - OTEL + Honeycomb observability design
Lines changed: 69 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,99 @@
1-
# Alloy + OpenTelemetry + Honeycomb Design
1+
# OpenTelemetry Operator + Honeycomb Design
22

3-
**Date**: 2026-03-22
3+
**Date**: 2026-03-22 (updated 2026-03-23)
44
**Status**: Implementing
55

66
## Goal
77

8-
Deploy Grafana Alloy as a unified OpenTelemetry collector that dual-ships all telemetry to both local Grafana stack and Honeycomb SaaS. This enables learning OTEL while comparing self-hosted vs SaaS observability.
8+
Deploy the CNCF OpenTelemetry Operator with Collector (agent + gateway) to replace Grafana Alloy. Dual-ships all telemetry to both local Grafana stack and Honeycomb SaaS. Auto-instrumentation enabled for zero-code trace generation.
99

1010
## Architecture
1111

1212
```
13-
──────────────┐
14-
│ Honeycomb
15-
│ (OTLP HTTP)
16-
└──────▲───────┘
17-
18-
┌───────────────────────────────┼────────────────────────┐
19-
│ Alloy DaemonSet (ns: alloy) │ │
20-
21-
┌──────────────┐ ┌────────┴─────────┐
22-
│ Pod log │───▶│ Batch processor │──────┐
23-
│ scraping │ (5s / 1024 batch)│
24-
└──────────────┘ └────────┬─────────┘ │
25-
│ │ │ │
26-
│ ┌──────────────┐ │ │
27-
│ │ OTLP receiver │─────traces─┘ │ │
28-
│ :4317 / :4318 │─────metrics──────────────────┘
29-
└──────────────┘
30-
└───────────────────────────────┼────────────────────────┘
31-
32-
┌────────────────────┼────────────────────┐
33-
34-
┌──────▼──────┐ ┌────────▼───────┐ ┌────────▼────────┐
35-
│ Loki Gateway │ │ Tempo :4317 │ Prometheus
36-
│ (loki-stack) │ │ (monitoring) │ │ remote-write
37-
└─────────────┘ └────────────────┘ └─────────────────┘
13+
┌─────────────────────────────────────────────────────────────┐
14+
│ OTEL Operator (Deployment)
15+
│ - Manages Collector instances via OpenTelemetryCollector CRD
16+
│ - Injects auto-instrumentation via Instrumentation CRD
17+
└─────────────────────────────────────────────────────────────┘
18+
19+
┌─────────────────────────────────────────────────────────────┐
20+
OTEL Collector Agent (DaemonSet) — per node
21+
- filelog receiver: scrapes /var/log/pods
22+
- otlp receiver: accepts traces/metrics from instrumented
23+
apps on :4317/:4318
24+
- Forwards all signals to Gateway via OTLP gRPC
25+
└──────────────────────────┬──────────────────────────────────┘
26+
│ OTLP gRPC
27+
──────────────────────────▼──────────────────────────────────┐
28+
OTEL Collector Gateway (Deployment, 2 replicas)
29+
- k8sattributes: enriches with k8s metadata from API
30+
│ - resource: sets service.name, cluster name │
31+
│ - batch: 10s / 8192 items
32+
│ - Fan-out to all backends:
33+
→ Loki via OTLP HTTP (logs)
34+
→ Tempo via OTLP gRPC (traces) │
35+
→ Prometheus remote-write (metrics)
36+
→ Honeycomb via OTLP HTTP (everything)
37+
└─────────────────────────────────────────────────────────────┘
3838
```
3939

4040
## Data Flow
4141

42-
| Signal | Source | Local Destination | Honeycomb |
43-
|---------|---------------------|--------------------------------------------|-----------|
44-
| Logs | Pod stdout/stderr | Loki via loki.write | OTLP HTTP |
45-
| Logs | K8s events | Loki via loki.write | OTLP HTTP |
46-
| Traces | Apps → OTLP :4317/8 | Tempo via OTLP gRPC | OTLP HTTP |
47-
| Metrics | Apps → OTLP :4317/8 | Prometheus via remote-write | OTLP HTTP |
42+
| Signal | Source | Local Destination | Honeycomb |
43+
|---------|---------------------------------|---------------------------|------------|
44+
| Logs | Pod stdout/stderr (filelog) | Loki via OTLP HTTP | OTLP HTTP |
45+
| Traces | Auto-instrumented apps → OTLP | Tempo via OTLP gRPC | OTLP HTTP |
46+
| Metrics | Auto-instrumented apps → OTLP | Prometheus remote-write | OTLP HTTP |
4847

4948
## Components
5049

51-
### New: `monitoring/alloy/`
50+
### New: `infrastructure/controllers/opentelemetry-operator/`
5251

53-
| File | Purpose |
54-
|---------------------|--------------------------------------------|
55-
| `ns.yaml` | Namespace `alloy` |
56-
| `kustomization.yaml`| Helm chart reference (alloy 1.6.2) |
57-
| `values.yaml` | DaemonSet config + Alloy pipeline |
58-
| `externalsecret.yaml`| Honeycomb API key from 1Password |
52+
| File | Purpose |
53+
|------------------------|------------------------------------------------------|
54+
| `ns.yaml` | Namespace `opentelemetry` |
55+
| `kustomization.yaml` | Helm chart (opentelemetry-operator 0.105.1) |
56+
| `values.yaml` | Operator config, cert-manager webhooks |
57+
| `externalsecret.yaml` | Honeycomb API key from 1Password |
58+
| `collector-agent.yaml` | OpenTelemetryCollector CRD (DaemonSet mode) |
59+
| `collector-gateway.yaml`| OpenTelemetryCollector CRD (Deployment mode) |
60+
| `instrumentation.yaml` | Instrumentation CRD (auto-inject config) |
5961

60-
### Modified: `monitoring/tempo/values.yaml`
62+
### Modified: `infrastructure/controllers/argocd/apps/infrastructure-appset.yaml`
6163

62-
Added OTLP gRPC (:4317) and HTTP (:4318) receivers so Tempo accepts traces from Alloy.
64+
Added `infrastructure/controllers/opentelemetry-operator` to the explicit path list.
65+
66+
### Modified (earlier): `monitoring/tempo/values.yaml`
67+
68+
Added OTLP gRPC (:4317) and HTTP (:4318) receivers.
69+
70+
### Deleted: `monitoring/alloy/`
71+
72+
Entire directory removed — replaced by OTEL Operator + Collector.
6373

6474
## Secrets
6575

66-
| Secret | Namespace | 1Password Key | Property |
67-
|---------------------|-----------|---------------|------------|
68-
| `honeycomb-api-key` | `alloy` | `honeycomb` | `api-key` |
76+
| Secret | Namespace | 1Password Key | Property |
77+
|---------------------|----------------|---------------|------------|
78+
| `honeycomb-api-key` | `opentelemetry`| `honeycomb` | `api-key` |
6979

70-
## How Apps Send Telemetry
80+
## Auto-Instrumentation
7181

72-
Apps instrumented with OTEL SDKs should set their exporter endpoint to:
82+
Apps opt-in by adding an annotation to their Deployment:
7383

74-
```
75-
OTEL_EXPORTER_OTLP_ENDPOINT=http://alloy.alloy.svc.cluster.local:4317
84+
```yaml
85+
metadata:
86+
annotations:
87+
instrumentation.opentelemetry.io/inject-python: "true"
88+
# or: inject-nodejs, inject-java, inject-go, inject-dotnet
7689
```
7790

78-
Alloy handles the fan-out to all backends.
91+
The Operator's webhook injects an init container with the OTEL SDK. The app automatically generates traces sent to the Agent's OTLP endpoint.
7992

8093
## Deployment
8194

82-
Auto-discovered by the monitoring AppSet (`monitoring/*` glob) at sync wave 5. No manual Application resource needed.
95+
Deployed via the infrastructure AppSet at sync wave 4. The Operator needs cert-manager for webhook TLS (cert-manager is already in the infrastructure AppSet).
96+
97+
## RBAC
98+
99+
The Operator creates ServiceAccounts for the Collectors. The gateway's `otel-gateway` SA needs RBAC to list/watch pods for the `k8sattributes` processor. The Operator handles this automatically.

infrastructure/controllers/argocd/apps/infrastructure-appset.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ spec:
2020
- path: infrastructure/controllers/metrics-server
2121
- path: infrastructure/controllers/reloader
2222
- path: infrastructure/controllers/vertical-pod-autoscaler
23+
- path: infrastructure/controllers/opentelemetry-operator
2324
- path: infrastructure/networking/cloudflared
2425
- path: infrastructure/networking/gateway
2526
- path: infrastructure/storage/container-registry
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
apiVersion: opentelemetry.io/v1beta1
2+
kind: OpenTelemetryCollector
3+
metadata:
4+
name: otel-agent
5+
namespace: opentelemetry
6+
spec:
7+
mode: daemonset
8+
serviceAccount: otel-agent
9+
tolerations:
10+
- operator: Exists
11+
resources:
12+
requests:
13+
cpu: 50m
14+
memory: 128Mi
15+
limits:
16+
cpu: 500m
17+
memory: 512Mi
18+
volumeMounts:
19+
- name: varlogpods
20+
mountPath: /var/log/pods
21+
readOnly: true
22+
volumes:
23+
- name: varlogpods
24+
hostPath:
25+
path: /var/log/pods
26+
env:
27+
- name: K8S_NODE_NAME
28+
valueFrom:
29+
fieldRef:
30+
fieldPath: spec.nodeName
31+
config:
32+
receivers:
33+
# Scrape pod logs from filesystem (more reliable than K8s API)
34+
filelog:
35+
include:
36+
- /var/log/pods/*/*/*.log
37+
exclude:
38+
- /var/log/pods/opentelemetry_otel-agent*/**
39+
start_at: end
40+
include_file_path: true
41+
include_file_name: false
42+
operators:
43+
# Parse CRI container log format
44+
- type: container
45+
id: container-parser
46+
max_log_size: 102400
47+
# Extract k8s metadata from file path
48+
- type: regex_parser
49+
id: extract_metadata_from_filepath
50+
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
51+
parse_from: attributes["log.file.path"]
52+
- type: move
53+
from: attributes.namespace
54+
to: resource["k8s.namespace.name"]
55+
- type: move
56+
from: attributes.pod_name
57+
to: resource["k8s.pod.name"]
58+
- type: move
59+
from: attributes.container_name
60+
to: resource["k8s.container.name"]
61+
- type: move
62+
from: attributes.uid
63+
to: resource["k8s.pod.uid"]
64+
- type: move
65+
from: attributes.restart_count
66+
to: resource["k8s.container.restart_count"]
67+
- type: add
68+
field: resource["k8s.node.name"]
69+
value: EXPR(env("K8S_NODE_NAME"))
70+
71+
# Accept OTLP from auto-instrumented apps
72+
otlp:
73+
protocols:
74+
grpc:
75+
endpoint: 0.0.0.0:4317
76+
http:
77+
endpoint: 0.0.0.0:4318
78+
79+
processors:
80+
batch:
81+
timeout: 5s
82+
send_batch_size: 1024
83+
84+
exporters:
85+
# Forward everything to the gateway
86+
otlp/gateway:
87+
endpoint: otel-gateway-collector.opentelemetry.svc.cluster.local:4317
88+
tls:
89+
insecure: true
90+
91+
service:
92+
pipelines:
93+
logs:
94+
receivers: [filelog]
95+
processors: [batch]
96+
exporters: [otlp/gateway]
97+
traces:
98+
receivers: [otlp]
99+
processors: [batch]
100+
exporters: [otlp/gateway]
101+
metrics:
102+
receivers: [otlp]
103+
processors: [batch]
104+
exporters: [otlp/gateway]

0 commit comments

Comments
 (0)