diff --git a/lfx_one/README.md b/lfx_one/README.md
index 301bf2d..328c998 100644
--- a/lfx_one/README.md
+++ b/lfx_one/README.md
@@ -12,6 +12,8 @@ release strategies, and deployment workflows.
staging, and production environments
- **[Secrets Management](./secrets-management.md)** - Complete guide for managing secrets using
1Password and AWS Secrets Manager across environments
+- **[Distributed Tracing](./tracing.md)** - OpenTelemetry tracing setup with
+ Datadog for Go services
## Architecture Reference
diff --git a/lfx_one/tracing.md b/lfx_one/tracing.md
new file mode 100644
index 0000000..1593220
--- /dev/null
+++ b/lfx_one/tracing.md
@@ -0,0 +1,632 @@
+# Distributed Tracing
+
+LFX uses [OpenTelemetry](https://opentelemetry.io/) (OTEL) for distributed
+tracing. Traces are collected by the
+[Datadog Agent](https://docs.datadoghq.com/opentelemetry/interoperability/otlp_ingest_in_the_agent/)
+running on each Kubernetes node and forwarded to Datadog for visualization
+and analysis.
+
+Only **traces** are collected via OTEL. Metrics are collected directly by the
+Datadog Agent, and logs use a separate pipeline. Trace and span IDs must be
+injected into log entries to correlate logs with traces in Datadog.
+
+## Architecture
+
+```mermaid
+flowchart LR
+ subgraph cluster["Kubernetes Cluster"]
+ subgraph node["Kubernetes Node"]
+ subgraph pod["Application Pod"]
+ app["Go Service
(OTEL SDK)"]
+ end
+ agent["Datadog Node Agent
:4317 gRPC"]
+ end
+ cluster_agent["Datadog Cluster Agent"]
+ end
+ dd["Datadog"]
+
+ app -- "OTLP/gRPC" --> agent
+ agent -- "APM traces" --> cluster_agent
+ cluster_agent -- "Datadog API" --> dd
+```
+
+The LFX Kubernetes platform runs two Datadog components:
+
+- **Datadog Node Agent** — runs as a DaemonSet on every node. Receives OTLP
+ traces from pods on the same node via gRPC on port `4317`, and collects
+ host-level metrics and logs.
+- **Datadog Cluster Agent** — runs as a Deployment (one per cluster).
+ Aggregates data from all node agents and forwards it to the Datadog backend.
+ Also provides cluster-level metadata enrichment for traces and metrics.
+
+Application pods export traces to the node-local agent using the downward API
+`HOST_IP`, keeping trace traffic within the node and off the cluster network.
+
+## Go SDK Setup
+
+LFX Go services use the
+[OpenTelemetry Go SDK](https://opentelemetry.io/docs/languages/go/) to
+instrument traces. The SDK is configured entirely through environment
+variables, requiring no Datadog-specific libraries in application code.
+
+### Required Environment Variables
+
+Set the following environment variables in your Kubernetes deployment:
+
+```yaml
+env:
+ - name: HOST_IP
+ valueFrom:
+ fieldRef:
+ fieldPath: status.hostIP
+ - name: OTEL_SERVICE_NAME
+ value: "my-service"
+ - name: OTEL_SERVICE_VERSION
+ value: "1.0.0"
+ - name: OTEL_EXPORTER_OTLP_ENDPOINT
+ value: "http://$(HOST_IP):4317"
+ - name: OTEL_EXPORTER_OTLP_PROTOCOL
+ value: "grpc"
+ - name: OTEL_PROPAGATORS
+ value: "tracecontext,baggage,jaeger"
+ - name: OTEL_TRACES_SAMPLER
+ value: "parentbased_traceidratio"
+ - name: OTEL_TRACES_SAMPLER_ARG
+ value: "0.5"
+```
+
+### Variable Reference
+
+| Variable | Description | Example |
+| ----------------------------- | ---------------------------------- | ----------------------------- |
+| `OTEL_SERVICE_NAME` | Identifies the service in Datadog | `query-service` |
+| `OTEL_SERVICE_VERSION` | Service version shown in traces | `1.2.0` |
+| `OTEL_EXPORTER_OTLP_ENDPOINT` | Datadog Agent OTLP endpoint | `http://$(HOST_IP):4317` |
+| `OTEL_EXPORTER_OTLP_PROTOCOL` | Transport protocol | `grpc` |
+| `OTEL_PROPAGATORS` | Context propagation formats | `tracecontext,baggage,jaeger` |
+| `OTEL_TRACES_SAMPLER` | Sampling strategy | `parentbased_traceidratio` |
+| `OTEL_TRACES_SAMPLER_ARG` | Sampling ratio (0.0 - 1.0) | `0.5` |
+
+### Context Propagation
+
+The `OTEL_PROPAGATORS` variable configures which context propagation formats
+are used when traces cross service boundaries. LFX services use:
+
+- **tracecontext** - W3C Trace Context (primary standard)
+- **baggage** - W3C Baggage for cross-service key-value pairs
+- **jaeger** - Jaeger propagation for compatibility
+
+All services must use the same propagator configuration to maintain trace
+continuity across service calls.
+
+## Sampling Configuration
+
+Sampling controls what percentage of traces are recorded. LFX uses
+`parentbased_traceidratio` which respects the sampling decision of parent
+spans and applies ratio-based sampling to root spans.
+
+### Recommended Ratios Per Environment
+
+```mermaid
+graph LR
+ subgraph environments["Sampling by Environment"]
+ dev["Dev
50%"]
+ staging["Staging
50%"]
+ prod["Production
20%"]
+ end
+
+ style dev fill:#4a9,stroke:#333,color:#fff
+ style staging fill:#49a,stroke:#333,color:#fff
+ style prod fill:#a94,stroke:#333,color:#fff
+```
+
+| Environment | `OTEL_TRACES_SAMPLER_ARG` | Rationale |
+| ----------- | ------------------------- | -------------------------------------------- |
+| Development | `0.5` | High visibility for debugging |
+| Staging | `0.5` | Match dev for pre-release validation |
+| Production | `0.2` | Balance observability with cost and overhead |
+
+Set the `OTEL_TRACES_SAMPLER_ARG` value per environment in your Helm values
+or Kustomize overlays.
+
+## Minimal Go Example
+
+The following example shows the minimum setup for a Go service using
+environment-variable-driven configuration:
+
+```go
+package main
+
+import (
+ "context"
+ "log"
+
+ "go.opentelemetry.io/otel"
+ "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
+ "go.opentelemetry.io/otel/propagation"
+ "go.opentelemetry.io/otel/sdk/resource"
+ sdktrace "go.opentelemetry.io/otel/sdk/trace"
+ "go.opentelemetry.io/contrib/propagators/jaeger"
+)
+
+func initTracer(ctx context.Context) (*sdktrace.TracerProvider, error) {
+ exporter, err := otlptracegrpc.New(ctx)
+ if err != nil {
+ return nil, err
+ }
+
+ // Service name, version, and environment are read from OTEL_SERVICE_NAME,
+ // OTEL_SERVICE_VERSION, and OTEL_RESOURCE_ATTRIBUTES env vars automatically.
+ res, err := resource.New(ctx,
+ resource.WithFromEnv(),
+ resource.WithTelemetrySDK(),
+ )
+ if err != nil {
+ return nil, err
+ }
+
+ tp := sdktrace.NewTracerProvider(
+ sdktrace.WithBatcher(exporter),
+ sdktrace.WithResource(res),
+ )
+
+ otel.SetTracerProvider(tp)
+ otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
+ propagation.TraceContext{},
+ propagation.Baggage{},
+ jaeger.Jaeger{},
+ ))
+
+ return tp, nil
+}
+
+func main() {
+ ctx := context.Background()
+ tp, err := initTracer(ctx)
+ if err != nil {
+ log.Fatal(err)
+ }
+ defer tp.Shutdown(ctx)
+
+ // Application code here
+ tracer := otel.Tracer("my-service")
+ ctx, span := tracer.Start(ctx, "operation-name")
+ defer span.End()
+}
+```
+
+The OTEL SDK reads `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_SERVICE_NAME`, and
+`OTEL_TRACES_SAMPLER`/`OTEL_TRACES_SAMPLER_ARG` from the environment
+automatically when using `otlptracegrpc.New(ctx)` and `resource.WithFromEnv()`.
+
+## Trace/Log Correlation
+
+Trace and span IDs must be injected into log entries so that logs can be
+linked to their corresponding trace in the observability backend.
+
+### Recommended: slog-otel Handler
+
+The recommended approach is to wrap your `slog` handler with
+[`slog-otel`](https://github.com/remychantenay/slog-otel), which automatically
+extracts the active trace and span ID from the context and adds them as
+standard OTEL attributes to every log record.
+
+```go
+import (
+ "log/slog"
+ "os"
+
+ slogotel "github.com/remychantenay/slog-otel"
+)
+
+func initLogger() {
+ base := slog.NewJSONHandler(os.Stdout, nil)
+ logger := slog.New(slogotel.OtelHandler{Next: base})
+ slog.SetDefault(logger)
+}
+```
+
+With this handler in place, any log call that receives a context containing an
+active span automatically includes the trace and span IDs:
+
+```go
+// trace_id and span_id are injected automatically
+slog.InfoContext(ctx, "processing started", slog.String("user_id", userID))
+```
+
+### Manual Injection
+
+If `slog-otel` is not available, extract the IDs directly from the active span:
+
+```go
+import (
+ "context"
+ "log/slog"
+
+ "go.opentelemetry.io/otel/trace"
+)
+
+func logWithTrace(ctx context.Context, msg string, args ...any) {
+ sc := trace.SpanFromContext(ctx).SpanContext()
+ attrs := []any{
+ slog.String("trace_id", sc.TraceID().String()),
+ slog.String("span_id", sc.SpanID().String()),
+ }
+ slog.InfoContext(ctx, msg, append(attrs, args...)...)
+}
+```
+
+Ensure your log pipeline forwards `trace_id` and `span_id` fields without
+modification so the observability backend can correlate them with traces.
+
+## Tracing External Requests
+
+### HTTP Client
+
+Use the
+[`otelhttp`](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp)
+contrib package to automatically create outbound spans and propagate trace
+context in HTTP request headers.
+
+```go
+import (
+ "context"
+ "net/http"
+
+ "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
+)
+
+var httpClient = &http.Client{
+ Transport: otelhttp.NewTransport(http.DefaultTransport),
+}
+
+func callDownstream(ctx context.Context, url string) (*http.Response, error) {
+ req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
+ if err != nil {
+ return nil, err
+ }
+ // otelhttp injects traceparent/tracestate headers automatically
+ return httpClient.Do(req)
+}
+```
+
+Wrap incoming HTTP handlers the same way to automatically create server-side
+spans:
+
+```go
+mux := http.NewServeMux()
+mux.HandleFunc("/api/resource", handleResource)
+
+handler := otelhttp.NewHandler(mux, "http-server",
+ otelhttp.WithMessageEvents(otelhttp.ReadEvents, otelhttp.WriteEvents),
+)
+http.ListenAndServe(":8080", handler)
+```
+
+### HTTP Trace Flow
+
+```mermaid
+sequenceDiagram
+ participant Client as HTTP Client
+ participant OTel as otelhttp Transport
+ participant Server as Downstream Service
+
+ Client->>OTel: http.Client.Do(req)
+ OTel->>OTel: Create outbound span
+ OTel->>Server: GET /resource
traceparent: 00-{traceID}-{spanID}-01
+ Server-->>OTel: 200 OK
+ OTel->>OTel: End span, record status
+ OTel-->>Client: *http.Response
+```
+
+### Database (SQL)
+
+Use the
+[`otelsql`](https://pkg.go.dev/github.com/XSAM/otelsql)
+package to wrap a standard `database/sql` driver and automatically trace every
+query as a child span.
+
+```go
+import (
+ "database/sql"
+
+ "github.com/XSAM/otelsql"
+ semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
+ _ "github.com/lib/pq"
+)
+
+func openDB(dsn string) (*sql.DB, error) {
+ db, err := otelsql.Open("postgres", dsn,
+ otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
+ )
+ if err != nil {
+ return nil, err
+ }
+ // Record connection pool metrics as span events
+ otelsql.RegisterDBStatsMetrics(db,
+ otelsql.WithAttributes(semconv.DBSystemPostgreSQL),
+ )
+ return db, nil
+}
+```
+
+Pass the request context to every query so spans are attached to the active
+trace:
+
+```go
+func getUser(ctx context.Context, db *sql.DB, id string) (*User, error) {
+ row := db.QueryRowContext(ctx, "SELECT id, name FROM users WHERE id = $1", id)
+ var u User
+ if err := row.Scan(&u.ID, &u.Name); err != nil {
+ return nil, err
+ }
+ return &u, nil
+}
+```
+
+Spans are automatically created for each query and include the SQL statement,
+database system, and connection details as span attributes.
+
+### Database Trace Flow
+
+```mermaid
+sequenceDiagram
+ participant Handler as HTTP Handler
+ participant otelsql as otelsql Wrapper
+ participant DB as PostgreSQL
+
+ Handler->>otelsql: db.QueryRowContext(ctx, sql)
+ otelsql->>otelsql: Create child span
db.statement = "SELECT ..."
+ otelsql->>DB: Execute query
+ DB-->>otelsql: Rows
+ otelsql->>otelsql: End span
+ otelsql-->>Handler: *sql.Row
+```
+
+## Naming Conventions
+
+Consistent naming makes traces easy to find and understand across services.
+
+### Span Names
+
+Span names should describe the operation, not the implementation. Use the
+format `{verb}.{noun}` in lower snake_case:
+
+| Context | Good | Avoid |
+| -------------- | --------------------------- | --------------------- |
+| HTTP handler | `http.get_user` | `GET /users/{id}` |
+| Database query | `db.query_users` | `SELECT * FROM users` |
+| Outbound HTTP | `http.post_notification` | `http.post` |
+| Business logic | `membership.calculate_fee` | `calculateFee` |
+
+For HTTP servers, `otelhttp` sets the span name to the route pattern
+automatically (e.g. `GET /api/users/{id}`). Override with
+`otelhttp.WithSpanNameFormatter` if the default is not descriptive enough.
+
+### Span Attributes
+
+Add attributes to spans to make them queryable. Always prefer
+[OpenTelemetry semantic conventions](https://opentelemetry.io/docs/specs/semconv/)
+over raw string keys when a matching convention exists. Use the `lfx.` prefix
+only for LFX-specific attributes that have no semconv equivalent.
+
+The `semconv` package provides typed constructors and key constants for all
+standard attributes. Using them ensures correct attribute names and avoids
+typos:
+
+```go
+import (
+ "go.opentelemetry.io/otel/attribute"
+ semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
+)
+
+span.SetAttributes(
+ // Use semconv constructors where available
+ semconv.HTTPRequestMethodKey.String(r.Method),
+ semconv.URLFull(r.URL.String()),
+ semconv.UserAgentOriginal(r.Header.Get("User-Agent")),
+ semconv.ServerAddress(host),
+ semconv.ServerPort(port),
+
+ // Use semconv key constants when a typed constructor is not available
+ attribute.String(string(semconv.PeerServiceKey), "downstream-service"),
+
+ // Use lfx. prefix only when no semconv equivalent exists
+ attribute.String("lfx.project_id", projectID),
+ attribute.String("lfx.org_id", orgID),
+ attribute.String("lfx.user_id", userID),
+)
+```
+
+Common semconv attributes by domain:
+
+| Domain | Semconv Key / Constructor | Example value |
+| ----------- | --------------------------------------- | -------------------------- |
+| HTTP | `semconv.HTTPRequestMethodKey` | `GET` |
+| HTTP | `semconv.URLFull()` | `https://api.lfx.dev/v1/` |
+| HTTP | `semconv.HTTPResponseStatusCodeKey` | `200` |
+| RPC | `semconv.RPCSystemKey` | `grpc` |
+| RPC | `semconv.RPCServiceKey` | `ProjectService` |
+| RPC | `semconv.RPCMethodKey` | `GetProject` |
+| Database | `semconv.DBSystemKey` | `postgresql` |
+| Database | `semconv.DBNameKey` | `lfx` |
+| Database | `semconv.DBOperationNameKey` | `SELECT` |
+| Messaging | `semconv.MessagingSystemKey` | `aws_sqs` |
+| Peer | `semconv.PeerServiceKey` | `downstream-service` |
+| Network | `semconv.ServerAddress()` | `db.internal` |
+
+Required attributes for all root spans:
+
+| Attribute | Source | Example |
+| ------------------------- | --------------------------- | --------------- |
+| `service.name` | `OTEL_SERVICE_NAME` | `query-service` |
+| `service.version` | `OTEL_SERVICE_VERSION` | `1.2.0` |
+| `deployment.environment` | `OTEL_RESOURCE_ATTRIBUTES` | `production` |
+
+### Tracer Names
+
+Name the tracer after the package or component that owns it, using the Go
+import path convention:
+
+```go
+// Use the package path as the tracer name
+tracer := otel.Tracer("github.com/linuxfoundation/lfx-v2-query-service/handlers")
+```
+
+## Accessing Traces
+
+### Local Development (Jaeger)
+
+In local development, traces are sent to a Jaeger instance running in
+OrbStack. Access the Jaeger UI by port-forwarding the Jaeger service:
+
+```bash
+kubectl port-forward svc/jaeger-query 16686:16686 -n observability
+```
+
+Then open `http://localhost:16686` in your browser.
+
+In the Jaeger UI:
+
+1. Select your service from the **Service** dropdown
+2. Optionally filter by **Operation** name
+3. Set a time range and click **Find Traces**
+4. Click a trace to expand the span waterfall
+
+To find a specific trace by ID (e.g. from a log entry):
+
+1. Paste the trace ID into the **Trace ID** field at the top of the page
+2. Press Enter
+
+### Cloud (Datadog APM)
+
+In staging and production, traces are available in
+[Datadog APM](https://app.datadoghq.com/apm/traces).
+
+To find traces for a specific service:
+
+1. Navigate to **APM → Traces**
+2. Use the **Service** facet on the left to filter by service name
+3. Use the **Env** facet to select the environment (`staging`, `production`)
+4. Click any trace row to open the flame graph view
+
+To jump from a log entry to its trace:
+
+1. Find the log entry in **Logs**
+2. Click the **View Trace** button in the log detail panel (requires
+ `trace_id` to be present in the log)
+
+## Common Query Patterns
+
+### Jaeger
+
+| Goal | Query |
+| ------------------------ | ----------------------------------------------- |
+| All traces for a service | Service: `query-service`, Operation: `all` |
+| Slow requests | Service: `query-service`, Min Duration: `500ms` |
+| Failed traces | Tags: `error=true` |
+| Traces for a user | Tags: `lfx.user_id=` |
+| Traces for a project | Tags: `lfx.project_id=` |
+
+### Datadog APM
+
+| Goal | Query |
+| ------------------------ | ------------------------------------------------------ |
+| All errors for a service | `service:query-service status:error` |
+| Slow database spans | `service:query-service span.type:sql @duration:>1s` |
+| Traces by user | `service:query-service @lfx.user_id:` |
+| Traces by project | `service:query-service @lfx.project_id:` |
+| High latency endpoints | `service:query-service @http.method:GET @duration:>2s` |
+
+Datadog query syntax uses `@` to prefix span attributes. Saved queries can
+be stored as **Monitors** or **Dashboards** for ongoing visibility.
+
+## Troubleshooting
+
+### Spans Not Appearing
+
+**Check the exporter endpoint.**
+
+Verify `OTEL_EXPORTER_OTLP_ENDPOINT` is set correctly and that `HOST_IP` is
+resolving. Run a debug pod on the same node and test connectivity:
+
+```bash
+kubectl run -it --rm debug --image=alpine --restart=Never -- \
+ wget -qO- http://:4317
+```
+
+The Datadog Agent OTLP gRPC port should be reachable. An ECONNREFUSED error
+means OTLP ingestion is not enabled on the agent.
+
+**Check that the tracer provider is initialized before use.**
+
+If `initTracer` is not called before the first span is created, the OTEL
+global provider is a no-op and spans are silently dropped.
+
+**Verify the Datadog Agent has OTLP enabled.**
+
+The node agent must have the following in its configuration:
+
+```yaml
+otlp_config:
+ receiver:
+ protocols:
+ grpc:
+ endpoint: 0.0.0.0:4317
+```
+
+### Traces Are Incomplete or Missing Spans
+
+**Missing context propagation.** If a downstream service does not receive the
+`traceparent` header, it starts a new root trace instead of continuing the
+existing one. Ensure:
+
+- All HTTP clients use `otelhttp.NewTransport`
+- All HTTP servers use `otelhttp.NewHandler`
+- gRPC clients/servers use the `otelgrpc` interceptors
+
+**Sampling mismatch.** If the downstream service uses a lower sample rate than
+the upstream, some child spans may be dropped. The `parentbased_traceidratio`
+sampler respects the parent's sampling decision, so ensure all services use it.
+
+### Sampling Not Taking Effect
+
+Confirm `OTEL_TRACES_SAMPLER=parentbased_traceidratio` and
+`OTEL_TRACES_SAMPLER_ARG` are both set. If `OTEL_TRACES_SAMPLER_ARG` is
+missing, the SDK defaults to `1.0` (100% sampling).
+
+### High Trace Volume in Production
+
+If trace volume is unexpectedly high, check that `OTEL_TRACES_SAMPLER_ARG`
+is set to `0.2` (20%) in the production environment. Use the Datadog APM
+ingestion control page to review per-service rates and apply an ingestion
+filter if needed.
+
+### Trace/Log Correlation Not Working
+
+Datadog requires the log source to be identified correctly before it can
+correlate logs with traces. Set the following pod annotation to declare the
+log source for your service:
+
+```yaml
+annotations:
+ ad.datadoghq.com/.logs: |
+ [{"source": "go", "service": "my-service"}]
+```
+
+Replace `` with the name of the container in the pod spec,
+and `my-service` with the value of `OTEL_SERVICE_NAME`. Without this
+annotation, Datadog may not match log entries to the correct service in APM.
+
+## References
+
+- [OpenTelemetry Go SDK](https://opentelemetry.io/docs/languages/go/)
+- [OTEL Environment Variable Spec](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/)
+- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) - standard span attribute names
+- [Datadog OTLP Ingestion](https://docs.datadoghq.com/opentelemetry/interoperability/otlp_ingest_in_the_agent/)
+- [Datadog Cluster Agent](https://docs.datadoghq.com/containers/cluster_agent/) - cluster-level Datadog component
+- [Datadog APM](https://app.datadoghq.com/apm/traces) - cloud trace explorer
+- [W3C Trace Context](https://www.w3.org/TR/trace-context/)
+- [slog-otel](https://github.com/remychantenay/slog-otel) - slog handler for OTEL trace/log correlation
+- [otelhttp](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp) - HTTP client/server instrumentation
+- [otelsql](https://pkg.go.dev/github.com/XSAM/otelsql) - database/sql instrumentation