fix(tracing): honor OTEL resource env config#7061
Conversation
Greptile SummaryThis PR improves Daft's OpenTelemetry configuration by reading
Confidence Score: 4/5The changes are well-scoped and the common cases work correctly; two minor issues in config.rs are worth addressing before merging. The resource-building logic works for all tested scenarios but relies on undocumented SDK insertion-order semantics when both OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES carry a service.name. The test helper also leaves env vars unrestored on panic. src/common/tracing/src/config.rs — the resource() method and the with_env_vars test helper. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["Config::resource()"] --> B["Build env_resource via EnvResourceDetector"]
B --> C{"OTEL_SERVICE_NAME set?"}
C -- Yes --> D["service_name = OTEL_SERVICE_NAME"]
C -- No --> E{"service.name in OTEL_RESOURCE_ATTRIBUTES?"}
E -- Yes --> F["service_name = value from env_resource"]
E -- No --> G["service_name = 'daft'"]
D --> H["Resource::builder_empty().with_attributes().with_service_name().build()"]
F --> H
G --> H
H --> I["Shared across Tracer / Meter / Logger providers"]
Reviews (1): Last reviewed commit: "fix(tracing): honor OTEL resource env co..." | Re-trigger Greptile |
| Resource::builder_empty() | ||
| .with_attributes( | ||
| env_resource | ||
| .iter() | ||
| .map(|(key, value)| KeyValue::new(key.clone(), value.clone())), | ||
| ) | ||
| .with_service_name(service_name) | ||
| .build() |
There was a problem hiding this comment.
If
OTEL_RESOURCE_ATTRIBUTES contains a service.name key, that attribute is included in the with_attributes call AND then also set by with_service_name. Whether the later with_service_name wins over the earlier attribute depends on the ResourceBuilder's internal deduplication order (the SDK uses a HashMap, so last-write-wins is likely, but the contract isn't documented). Filtering service.name out of the attributes iteration makes the priority explicit and removes the dependency on insertion order.
| Resource::builder_empty() | |
| .with_attributes( | |
| env_resource | |
| .iter() | |
| .map(|(key, value)| KeyValue::new(key.clone(), value.clone())), | |
| ) | |
| .with_service_name(service_name) | |
| .build() | |
| Resource::builder_empty() | |
| .with_attributes( | |
| env_resource | |
| .iter() | |
| .filter(|(key, _)| key.as_str() != "service.name") | |
| .map(|(key, value)| KeyValue::new(key.clone(), value.clone())), | |
| ) | |
| .with_service_name(service_name) | |
| .build() |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| fn with_env_vars<R>(vars: &[(&str, Option<&str>)], f: impl FnOnce() -> R) -> R { | ||
| let _guard = ENV_LOCK.lock().unwrap(); | ||
| let previous_values: Vec<_> = vars | ||
| .iter() | ||
| .map(|(name, _)| (*name, std::env::var(name).ok())) | ||
| .collect(); | ||
|
|
||
| for (name, value) in vars { | ||
| match value { | ||
| Some(value) => unsafe { std::env::set_var(name, value) }, | ||
| None => unsafe { std::env::remove_var(name) }, | ||
| } | ||
| } | ||
|
|
||
| let result = f(); | ||
|
|
||
| for (name, value) in previous_values { | ||
| match value { | ||
| Some(value) => unsafe { std::env::set_var(name, value) }, | ||
| None => unsafe { std::env::remove_var(name) }, | ||
| } | ||
| } | ||
|
|
||
| result | ||
| } |
There was a problem hiding this comment.
Test env-var cleanup is not panic-safe
If the closure f() panics (e.g. an assertion fires), the for loop that restores the previous values never executes, leaving the modified env vars behind for every subsequent test in the process. The ENV_LOCK mutex will also be poisoned by the panic, causing all later with_env_vars calls to propagate the poison via .unwrap(). A Drop-based restore guard would fix both problems.
| let exporter = opentelemetry_otlp::SpanExporter::builder() | ||
| .with_tonic() | ||
| .with_endpoint(otlp_endpoint) | ||
| .with_timeout(Duration::from_secs(10)) |
There was a problem hiding this comment.
Since this is now being read from the env what is the default if it's not set?
There was a problem hiding this comment.
It looks like the defaults were already 10s: https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/#timeout-configuration
| let env_resource = Resource::builder_empty() | ||
| .with_detector(Box::new(EnvResourceDetector::new())) | ||
| .build(); | ||
| let service_name = std::env::var("OTEL_SERVICE_NAME") |
There was a problem hiding this comment.
Is there a constant in the opentelemetry_otlp create we can use?
srilman
left a comment
There was a problem hiding this comment.
Just had 1 point but otherwise LGTM!
| } | ||
|
|
||
| pub fn resource(&self) -> Resource { | ||
| let env_resource = Resource::builder_empty() |
There was a problem hiding this comment.
I think we should use Resource::builder() instead, because its supposed to perform additional detection from other sources. Here's the docstring:
/// Creates a [ResourceBuilder] that allows you to configure multiple aspects of the Resource.
///
/// This [ResourceBuilder] will include the following [ResourceDetector]s:
/// - [SdkProvidedResourceDetector]
/// - [TelemetryResourceDetector]
/// - [EnvResourceDetector]
/// If you'd like to start from an empty resource, use [Resource::builder_empty].
|
Updated this. The final resource now uses Resource builder so SDK and telemetry detectors still run. The explicit Daft service name fallback stays in place. Focused test passes. cargo test -p common-tracing resource_ |
|
Could someone rerun the failed jobs when you get a chance? The tracing change looks clean from the logs. The native IO job hit external data access issues. Hugging Face returned HTTP 429s and S3 image reads timed out. The aggregate unit and integration jobs failed because those downstream jobs failed. The other unit failure happened after tests during coverage merge with llvm-profdata saying no profile can be merged. Local focused check still passes. cargo test -p common-tracing resource_ |
|
@RitwijParmar can you merge with main, it should help with the flaky tests |
|
Merged Focused tracing test still passes locally: Result: 3 passed. |
|
|
||
| pub fn resource(&self) -> Resource { | ||
| let env_resource = Resource::builder_empty() | ||
| .with_detector(Box::new(EnvResourceDetector::new())) |
There was a problem hiding this comment.
Any reason to only specify the EnvResourceDetector instead of the 3 defaults?
There was a problem hiding this comment.
Good catch. I changed this to derive from Resource::builder() so the SDK-provided, telemetry, and env detectors are all included.
I still filter the SDK unknown_service placeholder so Daft keeps its explicit daft fallback when no service name is configured. Focused test passes: cargo test -p common-tracing resource_.
Changes Made
This improves Daft's OpenTelemetry configuration path for production deployments:
ResourcefromOTEL_RESOURCE_ATTRIBUTESwhile preserving Daft's existingservice.name=daftdefaultOTEL_SERVICE_NAMEto override the default service name, so distributed workers / drivers can be tagged separately by deploymentOTEL_EXPORTER_OTLP_TIMEOUTand per-signal timeout env varsOTEL_SERVICE_NAME, and resource attributesThe upstream
opentelemetry-otlpbuilder already handles OTLP headers, compression, endpoints, and timeout env fallbacks. This keeps Daft from accidentally overriding those env-driven settings while still keeping the Daft-specific default resource name.Related Issues
Addresses part of #6144.
Validation:
cargo fmt -p common-tracingcargo test -p common-tracing