fix(tracing): honor OTEL resource env config by RitwijParmar · Pull Request #7061 · Eventual-Inc/Daft

RitwijParmar · 2026-06-03T17:37:45Z

Changes Made

This improves Daft's OpenTelemetry configuration path for production deployments:

build the OTEL Resource from OTEL_RESOURCE_ATTRIBUTES while preserving Daft's existing service.name=daft default
allow OTEL_SERVICE_NAME to override the default service name, so distributed workers / drivers can be tagged separately by deployment
stop passing a hardcoded 10s timeout into the OTLP exporters, which lets the upstream OTLP exporter honor OTEL_EXPORTER_OTLP_TIMEOUT and per-signal timeout env vars
add unit coverage for default service name, OTEL_SERVICE_NAME, and resource attributes

The upstream opentelemetry-otlp builder already handles OTLP headers, compression, endpoints, and timeout env fallbacks. This keeps Daft from accidentally overriding those env-driven settings while still keeping the Daft-specific default resource name.

Related Issues

Addresses part of #6144.

Validation:

cargo fmt -p common-tracing
cargo test -p common-tracing

greptile-apps · 2026-06-03T17:41:22Z

Greptile Summary

This PR improves Daft's OpenTelemetry configuration by reading OTEL_RESOURCE_ATTRIBUTES and OTEL_SERVICE_NAME from the environment to build the SDK Resource, and removes four hardcoded 10-second OTLP exporter timeouts so the upstream SDK can honour its own env-var-driven timeout settings.

config.rs: Adds Config::resource() which runs EnvResourceDetector for resource attributes, respects OTEL_SERVICE_NAME with a fallback to \"daft\", and includes three unit tests covering the default, override, and attribute-injection cases.
lib.rs: Replaces the hardcoded Resource::builder().with_service_name(\"daft\") call with config.resource(), and drops .with_timeout(Duration::from_secs(10)) from all four OTLP exporter builders.

Confidence Score: 4/5

The changes are well-scoped and the common cases work correctly; two minor issues in config.rs are worth addressing before merging.

The resource-building logic works for all tested scenarios but relies on undocumented SDK insertion-order semantics when both OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES carry a service.name. The test helper also leaves env vars unrestored on panic.

src/common/tracing/src/config.rs — the resource() method and the with_env_vars test helper.

Important Files Changed

Filename	Overview
src/common/tracing/src/config.rs	Adds `resource()` method that builds an OTel Resource from env vars with a minor panic-safety gap in the test helper.
src/common/tracing/src/lib.rs	Delegates resource construction to config.resource() and removes hardcoded OTLP timeouts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Config::resource()"] --> B["Build env_resource via EnvResourceDetector"]
    B --> C{"OTEL_SERVICE_NAME set?"}
    C -- Yes --> D["service_name = OTEL_SERVICE_NAME"]
    C -- No --> E{"service.name in OTEL_RESOURCE_ATTRIBUTES?"}
    E -- Yes --> F["service_name = value from env_resource"]
    E -- No --> G["service_name = 'daft'"]
    D --> H["Resource::builder_empty().with_attributes().with_service_name().build()"]
    F --> H
    G --> H
    H --> I["Shared across Tracer / Meter / Logger providers"]

_{Reviews (1): Last reviewed commit: "fix(tracing): honor OTEL resource env co..." | Re-trigger Greptile}

greptile-apps · 2026-06-03T17:41:28Z

+        Resource::builder_empty()
+            .with_attributes(
+                env_resource
+                    .iter()
+                    .map(|(key, value)| KeyValue::new(key.clone(), value.clone())),
+            )
+            .with_service_name(service_name)
+            .build()


If OTEL_RESOURCE_ATTRIBUTES contains a service.name key, that attribute is included in the with_attributes call AND then also set by with_service_name. Whether the later with_service_name wins over the earlier attribute depends on the ResourceBuilder's internal deduplication order (the SDK uses a HashMap, so last-write-wins is likely, but the contract isn't documented). Filtering service.name out of the attributes iteration makes the priority explicit and removes the dependency on insertion order.

Suggested change

Resource::builder_empty()

.with_attributes(

env_resource

.iter()

.map(|(key, value)| KeyValue::new(key.clone(), value.clone())),

)

.with_service_name(service_name)

.build()

Resource::builder_empty()

.with_attributes(

env_resource

.iter()

.filter(|(key, _)| key.as_str() != "service.name")

.map(|(key, value)| KeyValue::new(key.clone(), value.clone())),

)

.with_service_name(service_name)

.build()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-03T17:41:29Z

+    fn with_env_vars<R>(vars: &[(&str, Option<&str>)], f: impl FnOnce() -> R) -> R {
+        let _guard = ENV_LOCK.lock().unwrap();
+        let previous_values: Vec<_> = vars
+            .iter()
+            .map(|(name, _)| (*name, std::env::var(name).ok()))
+            .collect();
+
+        for (name, value) in vars {
+            match value {
+                Some(value) => unsafe { std::env::set_var(name, value) },
+                None => unsafe { std::env::remove_var(name) },
+            }
+        }
+
+        let result = f();
+
+        for (name, value) in previous_values {
+            match value {
+                Some(value) => unsafe { std::env::set_var(name, value) },
+                None => unsafe { std::env::remove_var(name) },
+            }
+        }
+
+        result
+    }


Test env-var cleanup is not panic-safe

If the closure f() panics (e.g. an assertion fires), the for loop that restores the previous values never executes, leaving the modified env vars behind for every subsequent test in the process. The ENV_LOCK mutex will also be poisoned by the panic, causing all later with_env_vars calls to propagate the poison via .unwrap(). A Drop-based restore guard would fix both problems.

cckellogg · 2026-06-09T21:55:16Z

    let exporter = opentelemetry_otlp::SpanExporter::builder()
        .with_tonic()
        .with_endpoint(otlp_endpoint)
-        .with_timeout(Duration::from_secs(10))


Since this is now being read from the env what is the default if it's not set?

It looks like the defaults were already 10s: https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#timeout-configuration

cckellogg · 2026-06-09T21:59:37Z

+        let env_resource = Resource::builder_empty()
+            .with_detector(Box::new(EnvResourceDetector::new()))
+            .build();
+        let service_name = std::env::var("OTEL_SERVICE_NAME")


Is there a constant in the opentelemetry_otlp create we can use?

srilman

Just had 1 point but otherwise LGTM!

srilman · 2026-06-15T23:42:32Z

    }
+
+    pub fn resource(&self) -> Resource {
+        let env_resource = Resource::builder_empty()


I think we should use Resource::builder() instead, because its supposed to perform additional detection from other sources. Here's the docstring:

/// Creates a [ResourceBuilder] that allows you to configure multiple aspects of the Resource. /// /// This [ResourceBuilder] will include the following [ResourceDetector]s: /// - [SdkProvidedResourceDetector] /// - [TelemetryResourceDetector] /// - [EnvResourceDetector] /// If you'd like to start from an empty resource, use [Resource::builder_empty].

RitwijParmar · 2026-06-16T05:54:00Z

Updated this.

The final resource now uses Resource builder so SDK and telemetry detectors still run. The explicit Daft service name fallback stays in place.

Focused test passes.

cargo test -p common-tracing resource_

RitwijParmar · 2026-06-16T17:47:47Z

Could someone rerun the failed jobs when you get a chance?

The tracing change looks clean from the logs. The native IO job hit external data access issues. Hugging Face returned HTTP 429s and S3 image reads timed out.

The aggregate unit and integration jobs failed because those downstream jobs failed. The other unit failure happened after tests during coverage merge with llvm-profdata saying no profile can be merged.

Local focused check still passes.

cargo test -p common-tracing resource_

srilman · 2026-06-18T22:29:49Z

@RitwijParmar can you merge with main, it should help with the flaky tests

…onfig

RitwijParmar · 2026-06-18T23:09:16Z

Merged origin/main into this branch and pushed the update.

Focused tracing test still passes locally:

cargo test -p common-tracing resource_

Result: 3 passed.

srilman

quick question

srilman · 2026-06-19T19:39:03Z

+
+    pub fn resource(&self) -> Resource {
+        let env_resource = Resource::builder_empty()
+            .with_detector(Box::new(EnvResourceDetector::new()))


Any reason to only specify the EnvResourceDetector instead of the 3 defaults?

Good catch. I changed this to derive from Resource::builder() so the SDK-provided, telemetry, and env detectors are all included.

I still filter the SDK unknown_service placeholder so Daft keeps its explicit daft fallback when no service name is configured. Focused test passes: cargo test -p common-tracing resource_.

fix(tracing): honor OTEL resource env config

7012fe7

github-actions Bot added the fix label Jun 3, 2026

greptile-apps Bot reviewed Jun 3, 2026

View reviewed changes

madvart requested a review from cckellogg June 9, 2026 20:44

cckellogg reviewed Jun 9, 2026

View reviewed changes

srilman reviewed Jun 15, 2026

View reviewed changes

fix(tracing): preserve default resource detectors

c452c4e

Merge remote-tracking branch 'origin/main' into codex/daft-otel-env-c…

60e65ad

…onfig

srilman reviewed Jun 19, 2026

View reviewed changes

fix(tracing): preserve default resource detectors

0b4c16b

Conversation

RitwijParmar commented Jun 3, 2026

Changes Made

Related Issues

Uh oh!

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cckellogg Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

srilman Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

cckellogg Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

srilman Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

RitwijParmar commented Jun 16, 2026

Uh oh!

RitwijParmar commented Jun 16, 2026

Uh oh!

srilman commented Jun 18, 2026

Uh oh!

RitwijParmar commented Jun 18, 2026

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

srilman Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

RitwijParmar Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants