Skip to content

Conversation

@fvaleye
Copy link
Collaborator

@fvaleye fvaleye commented Sep 30, 2025

Description

To better understand performance in the delta-rs crate, I added additional tracing to capture more detailed debug-level performance information.

Python now uses OpenTelemetry to collect tracing data emitted from Rust.
With this change, we gain true end-to-end visibility: Python spans can serve as parents of Rust spans (and vice versa), ensuring a continuous trace across both runtimes.

Related Issue(s)

Documentation

@fvaleye fvaleye added enhancement New feature or request binding/rust Issues for the Rust crate labels Sep 30, 2025
@github-actions
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch 2 times, most recently from 3f02b2c to 2641ccd Compare September 30, 2025 12:52
@fvaleye fvaleye changed the title feat(tracing): Add tracing spans to all I/O sections feat(tracing): add tracing spans to all I/O sections Sep 30, 2025
@codecov
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

❌ Patch coverage is 73.60775% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.98%. Comparing base (acd75d6) to head (610b4a6).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/src/tracing_otlp.rs 0.00% 37 Missing ⚠️
crates/core/src/operations/write/mod.rs 87.22% 6 Missing and 23 partials ⚠️
crates/core/src/kernel/transaction/mod.rs 76.85% 26 Missing and 2 partials ⚠️
python/src/lib.rs 0.00% 10 Missing ⚠️
crates/core/src/operations/filesystem_check.rs 71.42% 2 Missing ⚠️
crates/core/src/operations/vacuum.rs 71.42% 2 Missing ⚠️
crates/core/src/logstore/mod.rs 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3795      +/-   ##
==========================================
- Coverage   74.32%   73.98%   -0.35%     
==========================================
  Files         147      148       +1     
  Lines       39722    38882     -840     
  Branches    39722    38882     -840     
==========================================
- Hits        29523    28765     -758     
- Misses       8808     8852      +44     
+ Partials     1391     1265     -126     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from 2641ccd to e5812b1 Compare September 30, 2025 13:05
rtyler
rtyler previously approved these changes Sep 30, 2025
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally very excited to have more tracing in place. Once tests pass I'm comfortable merging this

let commit_or_bytes = this.commit_or_bytes;

if this.table_data.is_none() {
tracing::debug!("committing initial table version 0");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 in a lot of places we actually just import use tracing::*; which means this module prefix isn't necessary.

IMHO it's reasonable for us to pull tracing::* into most modules 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! It was a way to differentiate log from tracing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just imported with use tracing::*;, if you found it's ok for readability.
I didn't want to mix log and tracing crates, but it seems reasonable for now! We can tighten the imports or add explicit bridging later if the project grows.

Copy link
Collaborator

@ion-elgreco ion-elgreco Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do it now to be more explicit or is tracing a very small library?

@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch 3 times, most recently from 37d883c to ebed3e9 Compare September 30, 2025 13:24
@fvaleye
Copy link
Collaborator Author

fvaleye commented Sep 30, 2025

Generally very excited to have more tracing in place. Once tests pass I'm comfortable merging this

Same!

I’m re-requesting a review based on your feedback. I’m still waiting for the code coverage results, and we can proceed with a merge!

@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from ebed3e9 to fdd6946 Compare September 30, 2025 13:50
@fvaleye fvaleye requested a review from rtyler September 30, 2025 13:50
@fvaleye fvaleye enabled auto-merge (rebase) September 30, 2025 13:55
@ion-elgreco
Copy link
Collaborator

This is stdout/err tracing only, right?

@fvaleye
Copy link
Collaborator Author

fvaleye commented Sep 30, 2025

This is stdout/err tracing only, right?

I used warning, info, and debug levels together with tracing, adding Spans with custom fields to track performance.

tracing isn’t limited to stdout or stderr. It’s a subscriber-agnostic instrumentation framework that emits structured events with typed fields, not just plain text logs.

Where these traces go is entirely up to the user: with tracing-subscriber, traces can be sent to distributed tracing backends like Jaeger, Zipkin, Tempo, Honeycomb, or Datadog, to OpenTelemetry collectors, to custom backends, or nowhere at all.

Example:

// Option 1: Console output
let subscriber = tracing_subscriber::fmt()
    .with_env_filter("deltalake=debug")
    .finish();
tracing::subscriber::set_global_default(subscriber)?;

Side note: Importantly, if no subscriber is configured, the overhead is nearly zero.

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Oct 1, 2025

I get that, my point is rather, how would one configure it to use push to an OTEL endpoint from python ^^

@fvaleye
Copy link
Collaborator Author

fvaleye commented Oct 1, 2025

I get that, my point is rather, how would one configure it to use push to an OTEL endpoint from python ^^

Ah, sorry! 😀
Indeed, we might need to wire tracing-opentelemetry + opentelemetry-otlp so spans are pushed to an OTLP endpoint (e.g. http://localhost:4317). On the Python side, we just point OpenTelemetry SDK at the same collector. Both Rust and Python traces then land in the same backend.

In Python, we might need to add a small API like deltalake.init_tracing("otel_endpoint") that sets up the exporter inside the Rust runtime, WDYT?

I can take time to add it, since we have many cases where the performances is evaluated from Python!

@rtyler
Copy link
Member

rtyler commented Oct 1, 2025

I'm marking this as a draft, please change that when you're ready for this to be merged @fvaleye

@rtyler rtyler marked this pull request as draft October 1, 2025 12:55
auto-merge was automatically disabled October 1, 2025 12:55

Pull request was converted to draft

@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 1, 2025
@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch 2 times, most recently from d810d7d to a6738b0 Compare October 1, 2025 14:39
@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch 4 times, most recently from 69785aa to b6040eb Compare October 7, 2025 17:11
Resolved conflicts:
- crates/core/src/delta_datafusion/mod.rs: Kept find_files implementation with tracing
- crates/core/src/operations/optimize.rs: Merged logging with object_store initialization
- crates/core/src/operations/write/mod.rs: Used DeltaPlanner::new() from main, maintained tracing
- crates/core/src/protocol/checkpoints.rs: Used specific tracing imports from main
@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from b6040eb to ada34d4 Compare October 7, 2025 17:13
@ion-elgreco
Copy link
Collaborator

@fvaleye is this in a state it can be reviewed again?

@fvaleye
Copy link
Collaborator Author

fvaleye commented Oct 8, 2025

@fvaleye is this in a state it can be reviewed again?

Yes!

latest_version - steps,
(latest_version - steps) + 1,
)
async move {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional async move doesn't seem necessary, why this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more for adding an instrument to this part. instrument (commit_span).await
It ensures the span is correctly attached to the future and doesn't cause issues with await points.
But, I can extract the retry logic into a separate function and use #[instrument] macro on it, which would be cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, yeah the diff was so big I didn't see that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal, with all these whitespaces 😄


#[async_trait::async_trait]
impl<T: ObjectStore + Clone> ObjectStore for DeltaIOStorageBackend<T> {
#[instrument(skip(self, bytes), fields(path = %location, size = bytes.content_length()))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapper is almost never hit at this point, maybe we need a simple wrapper during storage creation that wraps and instruments these methods.

I also wonder if the object store doesn't have some native tracing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!
I don't know, but we could add an issue to track this need.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvaleye yeah sound goods

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there changes being made here? Seems like a wrong resolution of merge conflict

Copy link
Collaborator Author

@fvaleye fvaleye Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, let me take a look!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it, but there is a significant change due to the async traced block.

Comment on lines +63 to +67
with tracer.start_as_current_span("delta_write_operation") as span:
span.set_attribute("table.path", table_path)
span.set_attribute("operation.mode", "overwrite")

write_deltalake(table_path, sample_data, mode="overwrite")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a follow up pr could be to do this for users directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea! 👍

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flushing some first pass comments. awesome stuff!!

For everyone reviewing this, I recommend disabling whitespace changes.

.map_err(|err| -> TransactionError {
match err {
ObjectStoreError::AlreadyExists { .. } => {
warn!("commit entry already exists");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a warning? conflicting commits might be considered something we expect in the "normal" flow of things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!
I wanted to inform the user about the AlreadyExists error encountered, which falls between info and warn.

objection store communication. Please file Github issue to request for critical
openssl upgrade.

## Tracing and Observability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that it is high time to update our README, but is this critical enough for people coming in, or should this be in the docs (or both?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but I wanted to create a section to improve our issue resolution by adding instructions on how to enable tracing. This would provide additional context when investigating issues.
I can definitely move this somewhere else.

@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from 9402e59 to 78bd04f Compare October 8, 2025 17:34
@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from 78bd04f to 4e669e1 Compare October 9, 2025 09:29
@fvaleye fvaleye force-pushed the feature/add-tracing-for-performance-analysis branch from 4e669e1 to e04ffd1 Compare October 9, 2025 09:37
@ion-elgreco
Copy link
Collaborator

Just flushing some first pass comments. awesome stuff!!

For everyone reviewing this, I recommend disabling whitespace changes.

Yeah that definetly helps, should have enabled that earlier :P

Resolved conflicts in write operation by:
- Using session refactoring from main (instead of state)
- Keeping MetricObserver tracing from feature branch
- Maintaining instrumentation spans for performance analysis
@ion-elgreco ion-elgreco enabled auto-merge (squash) October 13, 2025 08:05
@ion-elgreco ion-elgreco merged commit 4acc60b into delta-io:main Oct 13, 2025
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce tracing spans to all I/O sections

5 participants