diff --git a/README.md b/README.md index 8f74fdd..8f52b02 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,13 @@ # DataFusion-DuckLake -**This is an early pre-release, that is very much so a work in progress.** +**This is an early pre-release and very much a work in progress.** A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage. +The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion. + +--- + ## Currently Supported - Read-only queries against DuckLake catalogs @@ -11,36 +15,78 @@ A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLak - Local filesystem and S3-compatible object stores (MinIO, S3) - Snapshot-based consistency - Basic and decimal types -- Hierarchical path resolution (data_path, schema, table, file) -- Delete files for row-level deletion (MOR - Merge-On-Read) +- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`) +- Delete files for row-level deletion (MOR – Merge-On-Read) - Parquet footer size hints for optimized I/O - Filter pushdown to Parquet for row group pruning and page-level filtering - Dynamic metadata lookup (no upfront catalog caching) +--- + ## Known Limitations - Complex types (nested lists, structs, maps) have minimal support - No write operations -- No filter-based file pruning (partition pruning not yet implemented) +- No partition-based file pruning - Single metadata provider implementation (DuckDB only) +- No time travel support + +--- + +## Roadmap + +This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository. + +### Metadata & Catalog Improvements + +- Metadata caching to reduce repeated catalog lookups +- Pluggable metadata providers beyond DuckDB: + - PostgreSQL + - SQLite + - MySQL +- Clear abstraction boundaries between catalog, metadata provider, and execution + +### Query Planning & Performance + +- Partition-aware file pruning +- Improved predicate pushdown +- Smarter Parquet I/O planning +- Reduced metadata round-trips during planning +- Better alignment with DataFusion optimizer rules + +### Write Support + +- Initial write support for DuckLake tables + +### Time Travel & Versioning -## TODO -- [ ] Support caching metadata -- [ ] Support alternative metadata databases - - [ ] postgres - - [ ] sqlite - - [ ] mysql -- [ ] Writes -- [ ] Timetravel +- Querying historical snapshots +- Explicit snapshot selection + +### Type System Expansion + +- Improved support for complex and nested types +- Better alignment with DuckDB and DataFusion type semantics + +### Stability & Ergonomics + +- Expanded test coverage +- Improved error messages and diagnostics +- Cleaner APIs for embedding in other DataFusion-based systems +- Additional documentation and examples + +--- ## Usage + ### Example + ```bash cargo run --example basic_query -- + ``` ### Integration - ```rust use datafusion::execution::runtime_env::RuntimeEnv; use datafusion::prelude::*; @@ -79,4 +125,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog)); // Query let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?; df.show().await?; + + ``` +### Project Status + +This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined. + +Feedback, issues, and contributions are welcome.