Skip to content

feat(observability): metrics + OTel spans for git/sops hot path (#63)#81

Merged
patrick-hermann-sva merged 1 commit into
mainfrom
feat/63-observability
Jun 14, 2026
Merged

feat(observability): metrics + OTel spans for git/sops hot path (#63)#81
patrick-hermann-sva merged 1 commit into
mainfrom
feat/63-observability

Conversation

@patrick-hermann-sva

Copy link
Copy Markdown
Contributor

Closes #63.

cloneAndReadFile runs on every reconcile but emitted no metrics or spans — git-fetch and decrypt latency were the blind spot during incidents.

Metrics

New internal/metrics package registers custom collectors with the controller-runtime registry, so they're exposed on the manager's existing /metrics endpoint (no new server):

Metric Type Labels
provider_kubeconfig_git_fetch_duration_seconds histogram repo, branch, operation, result
provider_kubeconfig_git_cache_total counter repo, branch, operation
provider_kubeconfig_sops_decrypt_duration_seconds histogram format, result
provider_kubeconfig_reconcile_errors_total counter stage (git|decrypt|secret|downstream)

EnsureCloned now returns an Operation (clone\|pull\|revision) so the cache counter distinguishes a fresh clone from a cache-hit pull — without the git package taking an observability dependency (keeps it reusable). FormatFromPath is exported for the decrypt-format label.

Tracing

New internal/tracing package adds OTel spans around EnsureCloned, ReadFile, and SOPSDecrypt. Off by default — activates only when a standard OTLP endpoint is set (OTEL_EXPORTER_OTLP_ENDPOINT), so out-of-the-box behavior is unchanged and a bad endpoint only logs (never fatal).

Also

  • Enriched the wrapped errors on this path to carry repo URL + file path (issue's error-wrapping ask).
  • README documents both metrics and tracing.
  • Tests: metrics recording (testutil), tracing disabled-by-default, and EnsureCloned operation assertion.

go build/test -race/vet and golangci-lint (pinned v2.11.4) all clean. New direct deps (prometheus/client_golang, otel, otel/sdk, otel/trace, otlptracegrpc) were already present transitively — just promoted by go mod tidy.

🤖 Generated with Claude Code

… path (#63)

cloneAndReadFile is on every reconcile but emitted no metrics or spans, so
git-fetch and decrypt latency were invisible during incidents.

Add an internal/metrics package registering custom collectors with the
controller-runtime registry (exposed on the existing /metrics endpoint):
- provider_kubeconfig_git_fetch_duration_seconds{repo,branch,operation,result}
- provider_kubeconfig_git_cache_total{repo,branch,operation}
- provider_kubeconfig_sops_decrypt_duration_seconds{format,result}
- provider_kubeconfig_reconcile_errors_total{stage}  (git|decrypt|secret|downstream)

EnsureCloned now returns an Operation (clone|pull|revision) so the cache
counter can distinguish a fresh clone from a cache-hit pull, without the
git package taking an observability dependency. FormatFromPath is exported
for the decrypt-format metric label.

Add OpenTelemetry spans around EnsureCloned, ReadFile and SOPSDecrypt via
a new internal/tracing package. Tracing is off by default and activates
only when a standard OTLP endpoint is configured (OTEL_EXPORTER_OTLP_*),
so behavior is unchanged out of the box; failures to init only log.

Also enrich the wrapped errors on this path to carry repo URL and file
path, and document metrics + tracing in the README.

Closes #63

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@patrick-hermann-sva patrick-hermann-sva merged commit 965954e into main Jun 14, 2026
2 checks passed
@patrick-hermann-sva patrick-hermann-sva deleted the feat/63-observability branch June 14, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Observability: add metrics and traces for git fetch and sops decrypt

1 participant