[LFXV2-1046] Add OTEL infrastructure and build-time variables#29
[LFXV2-1046] Add OTEL infrastructure and build-time variables#29
Conversation
- Create pkg/utils/otel.go with full OTEL SDK setup - Add Jaeger propagator support - Add Version, BuildTime, GitCommit build-time variables - Update OTEL shutdown to use context with timeout - Add ldflags to ko-build workflows for version injection - Add slog-otel handler for trace correlation in logs - Add comprehensive logging and OTEL tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
WalkthroughAdds OpenTelemetry integration (config, SDK setup, exporters, propagators), exposes build-time metadata via ldflags/CI (Version, BuildTime, GitCommit), instruments HTTP and logging, updates Helm values/templates and CI build steps, and adds tests and dependency upgrades for OTel and slog-otel. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application (main)
participant Env as Environment/CI
participant OTelUtil as utils.OTel
participant SDK as OpenTelemetry SDK
participant Exporter as OTLP Exporter
App->>Env: read OTEL_*, VERSION, BUILD_TIME, GIT_COMMIT
App->>OTelUtil: OTelConfigFromEnv() / build OTelConfig
OTelUtil-->>App: OTelConfig
App->>OTelUtil: SetupOTelSDKWithConfig(ctx, cfg)
OTelUtil->>SDK: create Resource & Composite Propagator
alt traces enabled
OTelUtil->>Exporter: create trace exporter (HTTP/gRPC)
Exporter-->>OTelUtil: trace exporter ready
OTelUtil->>SDK: register TracerProvider
end
alt metrics enabled
OTelUtil->>Exporter: create metrics exporter
Exporter-->>OTelUtil: metrics exporter ready
OTelUtil->>SDK: register MeterProvider
end
alt logs enabled
OTelUtil->>Exporter: create logs exporter
Exporter-->>OTelUtil: logs exporter ready
OTelUtil->>SDK: register LoggerProvider
end
OTelUtil-->>App: return shutdown func
App->>App: defer shutdown on exit -> SDK shutdown (flush & close)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/workflows/ko-build-tag.yaml:
- Line 69: Remove the unnecessary trailing backslash after the --sbom spdx flag
in the ko build command; locate the line containing "--sbom spdx \" and delete
the trailing backslash so the flag reads "--sbom spdx" (no continuation) as it
is the last argument to the ko build invocation.
In `@go.mod`:
- Around line 35-48: Update the indirect module entry for
github.com/nats-io/nkeys in go.mod from v0.4.5 to v0.4.6 (or later) to remediate
CVE-2023-46129, then refresh module dependencies so the parent NATS dependency
picks up the fixed version (e.g., run your usual Go module update/tidy workflow
to update go.sum and vendor as needed); ensure the go.mod entry for
github.com/nats-io/nkeys reflects v0.4.6+ and that builds/tests pass.
In `@pkg/log/log_test.go`:
- Around line 151-261: Replace manual environment save/restore logic with
t.Setenv to avoid state leakage: in TestInitStructureLogConfig_DefaultLevel call
t.Setenv("LOG_LEVEL", "") (or t.Setenv to original when needed) instead of
os.Getenv/os.Unsetenv/defer; in TestInitStructureLogConfig_WithLogLevel, inside
each subtest use t.Setenv("LOG_LEVEL", tc.logLevel) rather than the outer manual
save/restore; in TestInitStructureLogConfig_WithAddSource, inside each subtest
use t.Setenv("LOG_ADD_SOURCE", tc.addSource) instead of the current os package
manipulation; keep assertions that InitStructureLogConfig() returns non-nil.
Ensure you remove the corresponding os.Getenv/os.Setenv/os.Unsetenv defer
blocks.
In `@pkg/utils/otel_test.go`:
- Around line 15-188: The tests mutate global env vars manually; replace manual
os.Setenv + defer os.Unsetenv with t.Setenv for isolation in
TestOTelConfigFromEnv_CustomValues, TestOTelConfigFromEnv_TracesSampleRatio
(subtests), and TestOTelConfigFromEnv_InsecureFlag so each subtest uses
t.Setenv("VAR", "value") and you can remove the defer cleanup; keep
TestOTelConfigFromEnv_Defaults behavior that requires variables to be completely
unset by ensuring you call os.Unsetenv for the list at the top of that test (or
use a helper that unsets then verifies OTelConfigFromEnv), and update references
to cfg usage in those tests to rely on t.Setenv-scoped values (e.g.,
OTEL_SERVICE_NAME, OTEL_TRACES_SAMPLE_RATIO, OTEL_EXPORTER_OTLP_INSECURE) so
tests are isolated and no longer use manual defer cleanup.
🧹 Nitpick comments (7)
pkg/log/log_test.go (3)
25-32: Test should actually exercise the nil‑parent branch.The comment says “nil parent context,” but the test passes
context.TODO(). Consider passingnilto validate that code path.🔧 Suggested tweak
- ctx := AppendCtx(context.TODO(), attr) + ctx := AppendCtx(nil, attr)
116-149: Consider asserting that context attributes were actually injected.Right now the test captures the record but does not validate that
contextHandleradded the context attrs to the record.
263-353: Restore the previous slog default after this test.
InitStructureLogConfig()changes global slog state; restoring it avoids cross‑test interference.🔧 Suggested cleanup
- // Initialize logging (this sets up the slog-otel handler) - InitStructureLogConfig() + // Initialize logging (this sets up the slog-otel handler) + prevLogger := slog.Default() + InitStructureLogConfig() + defer slog.SetDefault(prevLogger)cmd/mailing-list-api/main.go (1)
74-80: Consider usingshutdownCtxfor error logging.On line 78,
slog.ErrorContext(ctx, ...)uses the originalctxfrom line 59, which may be canceled by the time the deferred shutdown runs. UsingshutdownCtxwould be more consistent.Suggested fix
defer func() { shutdownCtx, cancel := context.WithTimeout(context.Background(), gracefulShutdownSeconds*time.Second) defer cancel() if shutdownErr := otelShutdown(shutdownCtx); shutdownErr != nil { - slog.ErrorContext(ctx, "error shutting down OpenTelemetry SDK", "error", shutdownErr) + slog.ErrorContext(shutdownCtx, "error shutting down OpenTelemetry SDK", "error", shutdownErr) } }()pkg/log/log.go (1)
99-114: Return value inconsistency: returns base handler, not wrapped handler.The function returns
h(the baseJSONHandlerfrom line 99) rather than the fully wrappedlogger(which includesOtelHandlerandcontextHandler). If callers use the returned handler directly, they won't get trace correlation or context-based attributes.If the intent is to expose the base handler for testing/extension, consider documenting this clearly. Otherwise, consider returning the fully wrapped handler.
Option: Return the wrapped handler
// Wrap with contextHandler to support context-based attributes logger := contextHandler{otelHandler} slog.SetDefault(slog.New(logger)) slog.Info("log config", "logLevel", logOptions.Level, "addSource", logOptions.AddSource, ) - return h + return logger }pkg/utils/otel.go (2)
92-94: Minor: Inconsistent boolean parsing forOTEL_EXPORTER_OTLP_INSECURE.This only accepts
"true"whileLOG_ADD_SOURCEinpkg/log/log.goaccepts"true","t", or"1". For security-sensitive flags like insecure connections, stricter parsing is reasonable, but documenting this expectation would help avoid user confusion.
279-286: UseParentBasedsampler to ensure consistent end-to-end trace sampling.
TraceIDRatioBasedignores the parent span's sampling decision, which can result in child spans being sampled independently—leading to incomplete or fragmented traces. Wrapping withParentBasedensures that children inherit the parent's sampling decision, maintaining trace integrity across the entire call chain.Suggested fix
traceProvider := trace.NewTracerProvider( trace.WithResource(res), - trace.WithSampler(trace.TraceIDRatioBased(cfg.TracesSampleRatio)), + trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(cfg.TracesSampleRatio))), trace.WithBatcher(exporter, trace.WithBatchTimeout(time.Second), ), )
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) plumbing and build metadata injection to improve observability (traces/metrics/log context) and traceability of deployed binaries.
Changes:
- Introduces OTel SDK setup/config helpers (env-driven) plus unit tests for config parsing and SDK initialization.
- Updates structured logging to include trace/span IDs (via
slog-otel) and improves context attribute handling. - Extends Helm + GitHub Actions to configure OTel via env vars and inject build-time metadata via
ldflags.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/mailing-list-api/main.go |
Initializes OTel SDK on startup and logs build metadata (version/build time/commit). |
pkg/utils/otel.go |
Adds env-driven OTel config + SDK bootstrap for traces/metrics/logs exporters. |
pkg/utils/otel_test.go |
Adds tests for env parsing, constants, resource/propagator, and shutdown behavior. |
pkg/log/log.go |
Wraps slog handler with slog-otel, returns handler, and fixes context attr race by copying slices. |
pkg/log/log_test.go |
Adds tests around ctx attrs, log init behavior, and trace/span IDs in JSON logs. |
charts/lfx-v2-mailing-list-service/values.yaml |
Adds Helm values to configure OTel-related environment variables. |
charts/lfx-v2-mailing-list-service/templates/deployment.yaml |
Injects OTel env vars into the Deployment template from Helm values. |
.github/workflows/ko-build-main.yaml |
Adds build-time ldflags injection for VERSION/BUILD_TIME/GIT_COMMIT. |
.github/workflows/ko-build-branch.yaml |
Adds build-time ldflags injection for PR builds. |
.github/workflows/ko-build-tag.yaml |
Adds build-time ldflags injection for tagged releases. |
go.mod |
Adds OTel SDK/exporter deps + slog-otel, bumps/refreshes dependencies. |
go.sum |
Records updated dependency checksums for the new/updated modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protocol := os.Getenv("OTEL_EXPORTER_OTLP_PROTOCOL") | ||
| if protocol == "" { | ||
| protocol = OTelProtocolGRPC | ||
| } | ||
|
|
||
| endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") | ||
|
|
||
| insecure := os.Getenv("OTEL_EXPORTER_OTLP_INSECURE") == "true" | ||
|
|
||
| tracesExporter := os.Getenv("OTEL_TRACES_EXPORTER") | ||
| if tracesExporter == "" { | ||
| tracesExporter = OTelExporterNone | ||
| } | ||
|
|
||
| metricsExporter := os.Getenv("OTEL_METRICS_EXPORTER") | ||
| if metricsExporter == "" { | ||
| metricsExporter = OTelExporterNone | ||
| } | ||
|
|
||
| logsExporter := os.Getenv("OTEL_LOGS_EXPORTER") | ||
| if logsExporter == "" { | ||
| logsExporter = OTelExporterNone | ||
| } |
There was a problem hiding this comment.
OTEL_EXPORTER_OTLP_PROTOCOL and the *_EXPORTER env vars aren’t validated: any non-empty protocol other than "http" silently falls back to gRPC, and any exporter value other than "none" will enable OTLP. This can lead to surprising behavior when users set an unsupported value (e.g., typos). Validate allowed values ("grpc"/"http" and "otlp"/"none"), warn, and default to safe values.
| // The record should have been modified to include context attributes | ||
| // Note: This is a basic test - in a real implementation, you'd need to | ||
| // check that the attributes were actually added to the record | ||
| } |
There was a problem hiding this comment.
TestContextHandler_Handle doesn’t assert that context attributes were actually added to the handled record (the test currently always passes as long as no error occurs). Add an assertion that capturedRecord contains both the original record attrs and the context-provided attrs (e.g., by iterating over capturedRecord.Attrs).
| // Set up trace provider if enabled. | ||
| if cfg.TracesExporter != OTelExporterNone { | ||
| var tracerProvider *trace.TracerProvider | ||
| tracerProvider, err = newTraceProvider(ctx, cfg, res) | ||
| if err != nil { | ||
| handleErr(err) | ||
| return | ||
| } | ||
| shutdownFuncs = append(shutdownFuncs, tracerProvider.Shutdown) | ||
| otel.SetTracerProvider(tracerProvider) | ||
| } | ||
|
|
||
| // Set up metrics provider if enabled. | ||
| if cfg.MetricsExporter != OTelExporterNone { | ||
| var metricsProvider *metric.MeterProvider |
There was a problem hiding this comment.
Because exporters are enabled via cfg.*Exporter != "none", a zero-value OTelConfig{} (empty strings) will attempt to initialize OTLP exporters and connect to the default collector endpoint. Consider normalizing empty exporter fields to OTelExporterNone (or adding explicit validation) so the zero value is safe and doesn’t unexpectedly start network activity.
There was a problem hiding this comment.
Is a zero-value OTelConfig being passed somewhere?
| } | ||
| }() | ||
|
|
||
| // Initialize logging (this sets up the slog-otel handler) |
There was a problem hiding this comment.
This test’s output depends on external LOG_LEVEL (if it’s set to warn/error, the InfoContext log won’t be emitted and the test can fail). Set LOG_LEVEL explicitly within the test (or use t.Setenv). Also consider restoring the previous default slog logger after the test (since InitStructureLogConfig globally reconfigures slog to write to the pipe writer that gets closed).
| // Initialize logging (this sets up the slog-otel handler) | |
| // Initialize logging (this sets up the slog-otel handler) | |
| // Ensure the test does not depend on external LOG_LEVEL and restore the previous logger after the test. | |
| prevLogger := slog.Default() | |
| defer slog.SetDefault(prevLogger) | |
| t.Setenv("LOG_LEVEL", "info") |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/ko-build-main.yaml:
- Around line 30-42: The GOFLAGS export uses multiple separate -ldflags entries
so only the last (-X=main.GitCommit) takes effect; update the GOFLAGS assignment
(the GOFLAGS variable that currently repeats -ldflags) to combine all -X
settings into one -ldflags value (so include -X=main.Version, -X=main.BuildTime
and -X=main.GitCommit together) and apply the same fix to the other workflow
files that set GOFLAGS the same way; look for the GOFLAGS export lines and
replace the repeated -ldflags occurrences with a single -ldflags containing all
three -X flags (main.Version, main.BuildTime, main.GitCommit).
GOFLAGS parses values as space-separated flags. Use separate -ldflags entries with = syntax for each -X flag. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
db181bb to
23ae0af
Compare
Declare and assign separately to avoid masking return values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
- Add otelhttp middleware to trace HTTP requests - Fix zero-value OTelConfig edge case (empty string = disabled) - Fix lint errors in otel_test.go (handle os.Setenv errors) - Update semconv to v1.39.0 to match SDK version - Upgrade nkeys to v0.4.15 - Upgrade OTEL packages to v1.40.0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
- Remove unused ErrKey constant and its test - Remove unused return value from InitStructureLogConfig 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Parse OTEL_PROPAGATORS as a comma-separated list of propagator names (tracecontext, baggage, jaeger) instead of hardcoding all three. Returns an error for unsupported propagator names. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
Replace manual os.Setenv/os.Unsetenv with t.Setenv which auto-restores environment variables after each test. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Issue: LFXV2-1046 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Trevor Bramwell <tbramwell@linuxfoundation.org>
This pull request introduces comprehensive OpenTelemetry (OTel) support to the mailing list API service, enabling distributed tracing, metrics, and logging. It also adds build-time versioning information to the application, improves logging with trace context, and enhances deployment configurability through Helm and CI pipeline updates.
The most important changes are:
OpenTelemetry Instrumentation and Configuration:
otelhttp.NewHandler, ensuring all requests are traced (cmd/mailing-list-api/http.go) [1] [2].cmd/mailing-list-api/main.go,pkg/utils) [1] [2].otelsection in Helm values for fine-grained OTel configuration (service name, version, exporters, endpoints, protocols, etc.), and updates the deployment manifest to inject these as environment variables (charts/lfx-v2-mailing-list-service/values.yaml,charts/lfx-v2-mailing-list-service/templates/deployment.yaml) [1] [2].Build-time Metadata and CI/CD Integration:
.ko.yamlconfig to inject build-time variables (Version,BuildTime,GitCommit) via linker flags; updates all relevant GitHub Actions workflows to pass these variables during image builds (.ko.yaml,.github/workflows/ko-build-*.yaml) [1] [2] [3] [4].cmd/mailing-list-api/main.go).Logging Improvements:
slog-otelhandler, and refactors log initialization for clarity and flexibility (pkg/log/log.go) [1] [2].AppendCtxby copying context attributes before appending (pkg/log/log.go).Dependency Updates:
go.mod.These changes collectively enable robust observability for the service, making it easier to monitor, trace, and debug in production environments.