You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(observability): Prometheus metrics + OTel spans for git/sops hot path (#63)
cloneAndReadFile is on every reconcile but emitted no metrics or spans, so
git-fetch and decrypt latency were invisible during incidents.
Add an internal/metrics package registering custom collectors with the
controller-runtime registry (exposed on the existing /metrics endpoint):
- provider_kubeconfig_git_fetch_duration_seconds{repo,branch,operation,result}
- provider_kubeconfig_git_cache_total{repo,branch,operation}
- provider_kubeconfig_sops_decrypt_duration_seconds{format,result}
- provider_kubeconfig_reconcile_errors_total{stage} (git|decrypt|secret|downstream)
EnsureCloned now returns an Operation (clone|pull|revision) so the cache
counter can distinguish a fresh clone from a cache-hit pull, without the
git package taking an observability dependency. FormatFromPath is exported
for the decrypt-format metric label.
Add OpenTelemetry spans around EnsureCloned, ReadFile and SOPSDecrypt via
a new internal/tracing package. Tracing is off by default and activates
only when a standard OTLP endpoint is configured (OTEL_EXPORTER_OTLP_*),
so behavior is unchanged out of the box; failures to init only log.
Also enrich the wrapped errors on this path to carry repo URL and file
path, and document metrics + tracing in the README.
Closes#63
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -338,6 +338,23 @@ Git sources are cloned into a per-repo cache directory. The cache root is create
338
338
| `PROVIDER_KUBECONFIG_CACHE_DIR` | `$XDG_CACHE_HOME/provider-kubeconfig` (else `$TMPDIR/provider-kubeconfig`) | Cache root. Point at a dedicated writable volume (e.g. an `emptyDir`) to keep clones off shared `/tmp`. |
339
339
| `PROVIDER_KUBECONFIG_CACHE_MAX_ENTRIES` | `32` | Max cached repo directories retained before LRU eviction. |
340
340
341
+
## Observability
342
+
343
+
### Metrics
344
+
345
+
Custom Prometheus metrics are exposed on the manager's existing `/metrics` endpoint alongside the standard controller-runtime and crossplane metrics:
The reconcile hot path emits OpenTelemetry spans (`git.EnsureCloned`, `git.ReadFile`, `sops.Decrypt`) so traces show which phase dominates. Tracing is **off by default** and activates when a standard OTLP endpoint is configured — e.g. set `OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317`. Standard `OTEL_*` env vars (headers, TLS, sampling) are honored.
0 commit comments