From b8b5aa2bbcb4a0e1b2463b3aa0d39c187062b0f0 Mon Sep 17 00:00:00 2001 From: Eddie A Tejeda Date: Tue, 16 Dec 2025 08:33:24 -0800 Subject: [PATCH 1/2] Refine README content and add roadmap section Updated README.md for clarity and added roadmap details. --- README.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 69 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 8f74fdd..26a4c44 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,13 @@ # DataFusion-DuckLake -**This is an early pre-release, that is very much so a work in progress.** +**This is an early pre-release and very much a work in progress.** A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage. +The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion. + +--- + ## Currently Supported - Read-only queries against DuckLake catalogs @@ -11,36 +15,81 @@ A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLak - Local filesystem and S3-compatible object stores (MinIO, S3) - Snapshot-based consistency - Basic and decimal types -- Hierarchical path resolution (data_path, schema, table, file) -- Delete files for row-level deletion (MOR - Merge-On-Read) +- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`) +- Delete files for row-level deletion (MOR – Merge-On-Read) - Parquet footer size hints for optimized I/O - Filter pushdown to Parquet for row group pruning and page-level filtering - Dynamic metadata lookup (no upfront catalog caching) +--- + ## Known Limitations - Complex types (nested lists, structs, maps) have minimal support - No write operations -- No filter-based file pruning (partition pruning not yet implemented) +- No partition-based file pruning - Single metadata provider implementation (DuckDB only) +- No time travel support + +--- + +## Roadmap + +This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository. + +### Metadata & Catalog Improvements + +- Metadata caching to reduce repeated catalog lookups +- Pluggable metadata providers beyond DuckDB: + - PostgreSQL + - SQLite + - MySQL +- Clear abstraction boundaries between catalog, metadata provider, and execution + +### Query Planning & Performance + +- Partition-aware file pruning +- Improved predicate pushdown +- Smarter Parquet I/O planning +- Reduced metadata round-trips during planning +- Better alignment with DataFusion optimizer rules + +### Write Support + +- Initial write support for DuckLake tables +- Append-only writes +- Foundations for future upsert and delete workflows +- Proper snapshot and commit handling + +### Time Travel & Versioning -## TODO -- [ ] Support caching metadata -- [ ] Support alternative metadata databases - - [ ] postgres - - [ ] sqlite - - [ ] mysql -- [ ] Writes -- [ ] Timetravel +- Querying historical snapshots +- Explicit snapshot selection + +### Type System Expansion + +- Improved support for complex and nested types +- Better alignment with DuckDB and DataFusion type semantics + +### Stability & Ergonomics + +- Expanded test coverage +- Improved error messages and diagnostics +- Cleaner APIs for embedding in other DataFusion-based systems +- Additional documentation and examples + +--- ## Usage + ### Example + ```bash cargo run --example basic_query -- + ``` ### Integration - ```rust use datafusion::execution::runtime_env::RuntimeEnv; use datafusion::prelude::*; @@ -79,4 +128,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog)); // Query let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?; df.show().await?; + + ``` +### Project Status + +This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined. + +Feedback, issues, and contributions are welcome. From 1033ac8cde7f800c3fc8f11e94ba3a0e3e25785f Mon Sep 17 00:00:00 2001 From: Eddie A Tejeda Date: Wed, 17 Dec 2025 10:44:55 -0800 Subject: [PATCH 2/2] Update README.md Co-authored-by: Zac Farrell --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index 26a4c44..8f52b02 100644 --- a/README.md +++ b/README.md @@ -57,9 +57,6 @@ This project is under active development. The roadmap below reflects major areas ### Write Support - Initial write support for DuckLake tables -- Append-only writes -- Foundations for future upsert and delete workflows -- Proper snapshot and commit handling ### Time Travel & Versioning