Skip to content

Conversation

@roeap
Copy link
Collaborator

@roeap roeap commented Oct 9, 2025

Description

This is the first direct step toward kernelized scans. The core of this will be a new TableProvider. This new engine directly integrates with the datafusion session (more specifically, TaskContext). I thought about making a more generic engine that we could also use for non-datafusion code-paths. However, we would like to exchange the "inner" implementations to also leverage datafusion abstractions, at which point these would likely diverge. Thus, hoping to integrate this deeper first, before seeing if and how we can share between implementations.

Why do we need a dedicated engine? URL support and moving away from object store paths.

I have this integrated with a table provider locally, but I am trying to keep things reviewable.

@roeap roeap requested review from hntd187 and rtyler as code owners October 9, 2025 15:36
@roeap roeap requested review from fvaleye and ion-elgreco October 9, 2025 15:36
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Oct 9, 2025
Comment on lines +87 to +97
let handler: Arc<dyn JsonHandler> = match self.handle.runtime_flavor() {
RuntimeFlavor::MultiThread => Arc::new(DefaultJsonHandler::new(
store,
Arc::new(TokioMultiThreadExecutor::new(self.handle.clone())),
)),
RuntimeFlavor::CurrentThread => Arc::new(DefaultJsonHandler::new(
store,
Arc::new(TokioBackgroundExecutor::new()),
)),
_ => panic!("unsupported runtime flavor"),
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We replicate this logic inside functions to avoid the generic parameter spilling out into our implementations. This tends to bubble up quite quickly since TaskExecutor is not dyn.

Once we move to datafusion specific implementations, these will not be generic over the runtime since datafusion and we are tied to tokio.

Comment on lines +112 to +127
Ok(Box::new(
grouped_files
.into_iter()
.map(|(url, files)| {
self.get_or_create_pq(url)?.read_parquet_files(
&files.to_vec(),
physical_schema.clone(),
predicate.clone(),
)
})
// TODO: this should not do any blocking operations, since this should
// happen when the iterators are polled and we are just creating a vec of iterators.
// Is this correct?
.try_collect::<_, Vec<_>, _>()?
.into_iter()
.flatten(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love some feedback on this comment if I made the right assumptions here. We can also handle this in the map, but this gets a bit messy...

Comment on lines +138 to +145
pub(crate) fn group_by_store<T: IntoIterator<Item = impl AsObjectStoreUrl>>(
files: T,
) -> std::collections::HashMap<ObjectStoreUrl, Vec<T::Item>> {
files
.into_iter()
.map(|item| (item.as_object_store_url(), item))
.into_group_map()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the core piece for routing things to individual sub-instances scoped to an individual store.

There is a caveat in there in that we could be changing the order of inputs, which, of course, hurts log replay. But will look closer into this. The good news is that logs reside in a single store; only data files are potentially distributed across stores.

@roeap roeap force-pushed the feat/kernel-df-engine branch from ba9df85 to 86e7655 Compare October 9, 2025 16:19
@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 36.73469% with 155 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.74%. Comparing base (9f67f9f) to head (34e9bcd).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...s/core/src/delta_datafusion/engine/file_formats.rs 0.00% 115 Missing ⚠️
crates/core/src/delta_datafusion/engine/mod.rs 0.00% 20 Missing ⚠️
crates/core/src/delta_datafusion/engine/storage.rs 81.72% 12 Missing and 5 partials ⚠️
crates/core/src/kernel/snapshot/mod.rs 81.25% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3831      +/-   ##
==========================================
- Coverage   73.99%   73.74%   -0.25%     
==========================================
  Files         148      151       +3     
  Lines       38904    39117     +213     
  Branches    38904    39117     +213     
==========================================
+ Hits        28788    28848      +60     
- Misses       8850     8999     +149     
- Partials     1266     1270       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@roeap roeap force-pushed the feat/kernel-df-engine branch from 86e7655 to 7eeb6e6 Compare October 9, 2025 16:43
@roeap roeap marked this pull request as draft October 11, 2025 09:15
@rtyler rtyler changed the title feat: datafusiuon based kernel engine feat: datafusion based kernel engine Oct 14, 2025
@rtyler rtyler marked this pull request as ready for review October 14, 2025 14:28
@rtyler rtyler enabled auto-merge (rebase) October 14, 2025 14:28
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code is a no-op and not currently used, I think it's safe to merge in order to advance more collaboration in this area

@rtyler rtyler merged commit d950bf2 into delta-io:main Oct 14, 2025
25 of 26 checks passed
@roeap roeap deleted the feat/kernel-df-engine branch October 14, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/rust Issues for the Rust crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants