Skip to content

fix(tracing): honor OTEL resource env config#7061

Open
RitwijParmar wants to merge 4 commits into
Eventual-Inc:mainfrom
RitwijParmar:codex/daft-otel-env-config
Open

fix(tracing): honor OTEL resource env config#7061
RitwijParmar wants to merge 4 commits into
Eventual-Inc:mainfrom
RitwijParmar:codex/daft-otel-env-config

Conversation

@RitwijParmar

Copy link
Copy Markdown
Contributor

Changes Made

This improves Daft's OpenTelemetry configuration path for production deployments:

  • build the OTEL Resource from OTEL_RESOURCE_ATTRIBUTES while preserving Daft's existing service.name=daft default
  • allow OTEL_SERVICE_NAME to override the default service name, so distributed workers / drivers can be tagged separately by deployment
  • stop passing a hardcoded 10s timeout into the OTLP exporters, which lets the upstream OTLP exporter honor OTEL_EXPORTER_OTLP_TIMEOUT and per-signal timeout env vars
  • add unit coverage for default service name, OTEL_SERVICE_NAME, and resource attributes

The upstream opentelemetry-otlp builder already handles OTLP headers, compression, endpoints, and timeout env fallbacks. This keeps Daft from accidentally overriding those env-driven settings while still keeping the Daft-specific default resource name.

Related Issues

Addresses part of #6144.

Validation:

  • cargo fmt -p common-tracing
  • cargo test -p common-tracing

@github-actions github-actions Bot added the fix label Jun 3, 2026
@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR improves Daft's OpenTelemetry configuration by reading OTEL_RESOURCE_ATTRIBUTES and OTEL_SERVICE_NAME from the environment to build the SDK Resource, and removes four hardcoded 10-second OTLP exporter timeouts so the upstream SDK can honour its own env-var-driven timeout settings.

  • config.rs: Adds Config::resource() which runs EnvResourceDetector for resource attributes, respects OTEL_SERVICE_NAME with a fallback to \"daft\", and includes three unit tests covering the default, override, and attribute-injection cases.
  • lib.rs: Replaces the hardcoded Resource::builder().with_service_name(\"daft\") call with config.resource(), and drops .with_timeout(Duration::from_secs(10)) from all four OTLP exporter builders.

Confidence Score: 4/5

The changes are well-scoped and the common cases work correctly; two minor issues in config.rs are worth addressing before merging.

The resource-building logic works for all tested scenarios but relies on undocumented SDK insertion-order semantics when both OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES carry a service.name. The test helper also leaves env vars unrestored on panic.

src/common/tracing/src/config.rs — the resource() method and the with_env_vars test helper.

Important Files Changed

Filename Overview
src/common/tracing/src/config.rs Adds resource() method that builds an OTel Resource from env vars with a minor panic-safety gap in the test helper.
src/common/tracing/src/lib.rs Delegates resource construction to config.resource() and removes hardcoded OTLP timeouts.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Config::resource()"] --> B["Build env_resource via EnvResourceDetector"]
    B --> C{"OTEL_SERVICE_NAME set?"}
    C -- Yes --> D["service_name = OTEL_SERVICE_NAME"]
    C -- No --> E{"service.name in OTEL_RESOURCE_ATTRIBUTES?"}
    E -- Yes --> F["service_name = value from env_resource"]
    E -- No --> G["service_name = 'daft'"]
    D --> H["Resource::builder_empty().with_attributes().with_service_name().build()"]
    F --> H
    G --> H
    H --> I["Shared across Tracer / Meter / Logger providers"]
Loading

Reviews (1): Last reviewed commit: "fix(tracing): honor OTEL resource env co..." | Re-trigger Greptile

Comment thread src/common/tracing/src/config.rs Outdated
Comment on lines +127 to +134
Resource::builder_empty()
.with_attributes(
env_resource
.iter()
.map(|(key, value)| KeyValue::new(key.clone(), value.clone())),
)
.with_service_name(service_name)
.build()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 If OTEL_RESOURCE_ATTRIBUTES contains a service.name key, that attribute is included in the with_attributes call AND then also set by with_service_name. Whether the later with_service_name wins over the earlier attribute depends on the ResourceBuilder's internal deduplication order (the SDK uses a HashMap, so last-write-wins is likely, but the contract isn't documented). Filtering service.name out of the attributes iteration makes the priority explicit and removes the dependency on insertion order.

Suggested change
Resource::builder_empty()
.with_attributes(
env_resource
.iter()
.map(|(key, value)| KeyValue::new(key.clone(), value.clone())),
)
.with_service_name(service_name)
.build()
Resource::builder_empty()
.with_attributes(
env_resource
.iter()
.filter(|(key, _)| key.as_str() != "service.name")
.map(|(key, value)| KeyValue::new(key.clone(), value.clone())),
)
.with_service_name(service_name)
.build()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +148 to +172
fn with_env_vars<R>(vars: &[(&str, Option<&str>)], f: impl FnOnce() -> R) -> R {
let _guard = ENV_LOCK.lock().unwrap();
let previous_values: Vec<_> = vars
.iter()
.map(|(name, _)| (*name, std::env::var(name).ok()))
.collect();

for (name, value) in vars {
match value {
Some(value) => unsafe { std::env::set_var(name, value) },
None => unsafe { std::env::remove_var(name) },
}
}

let result = f();

for (name, value) in previous_values {
match value {
Some(value) => unsafe { std::env::set_var(name, value) },
None => unsafe { std::env::remove_var(name) },
}
}

result
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Test env-var cleanup is not panic-safe

If the closure f() panics (e.g. an assertion fires), the for loop that restores the previous values never executes, leaving the modified env vars behind for every subsequent test in the process. The ENV_LOCK mutex will also be poisoned by the panic, causing all later with_env_vars calls to propagate the poison via .unwrap(). A Drop-based restore guard would fix both problems.

@madvart madvart requested a review from cckellogg June 9, 2026 20:44
let exporter = opentelemetry_otlp::SpanExporter::builder()
.with_tonic()
.with_endpoint(otlp_endpoint)
.with_timeout(Duration::from_secs(10))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now being read from the env what is the default if it's not set?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let env_resource = Resource::builder_empty()
.with_detector(Box::new(EnvResourceDetector::new()))
.build();
let service_name = std::env::var("OTEL_SERVICE_NAME")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a constant in the opentelemetry_otlp create we can use?

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had 1 point but otherwise LGTM!

Comment thread src/common/tracing/src/config.rs Outdated
}

pub fn resource(&self) -> Resource {
let env_resource = Resource::builder_empty()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use Resource::builder() instead, because its supposed to perform additional detection from other sources. Here's the docstring:

    /// Creates a [ResourceBuilder] that allows you to configure multiple aspects of the Resource.
    ///
    /// This [ResourceBuilder] will include the following [ResourceDetector]s:
    /// - [SdkProvidedResourceDetector]
    /// - [TelemetryResourceDetector]
    /// - [EnvResourceDetector]
    ///   If you'd like to start from an empty resource, use [Resource::builder_empty].

@RitwijParmar

Copy link
Copy Markdown
Contributor Author

Updated this.

The final resource now uses Resource builder so SDK and telemetry detectors still run. The explicit Daft service name fallback stays in place.

Focused test passes.

cargo test -p common-tracing resource_

@RitwijParmar

Copy link
Copy Markdown
Contributor Author

Could someone rerun the failed jobs when you get a chance?

The tracing change looks clean from the logs. The native IO job hit external data access issues. Hugging Face returned HTTP 429s and S3 image reads timed out.

The aggregate unit and integration jobs failed because those downstream jobs failed. The other unit failure happened after tests during coverage merge with llvm-profdata saying no profile can be merged.

Local focused check still passes.

cargo test -p common-tracing resource_

@srilman

srilman commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@RitwijParmar can you merge with main, it should help with the flaky tests

@RitwijParmar

Copy link
Copy Markdown
Contributor Author

Merged origin/main into this branch and pushed the update.

Focused tracing test still passes locally:

cargo test -p common-tracing resource_

Result: 3 passed.

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question

Comment thread src/common/tracing/src/config.rs Outdated

pub fn resource(&self) -> Resource {
let env_resource = Resource::builder_empty()
.with_detector(Box::new(EnvResourceDetector::new()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to only specify the EnvResourceDetector instead of the 3 defaults?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I changed this to derive from Resource::builder() so the SDK-provided, telemetry, and env detectors are all included.

I still filter the SDK unknown_service placeholder so Daft keeps its explicit daft fallback when no service name is configured. Focused test passes: cargo test -p common-tracing resource_.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants