Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 66 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,92 @@
# DataFusion-DuckLake

**This is an early pre-release, that is very much so a work in progress.**
**This is an early pre-release and very much a work in progress.**

A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage.

The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion.

---

## Currently Supported

- Read-only queries against DuckLake catalogs
- DuckDB catalog backend
- Local filesystem and S3-compatible object stores (MinIO, S3)
- Snapshot-based consistency
- Basic and decimal types
- Hierarchical path resolution (data_path, schema, table, file)
- Delete files for row-level deletion (MOR - Merge-On-Read)
- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`)
- Delete files for row-level deletion (MOR Merge-On-Read)
- Parquet footer size hints for optimized I/O
- Filter pushdown to Parquet for row group pruning and page-level filtering
- Dynamic metadata lookup (no upfront catalog caching)

---

## Known Limitations

- Complex types (nested lists, structs, maps) have minimal support
- No write operations
- No filter-based file pruning (partition pruning not yet implemented)
- No partition-based file pruning
- Single metadata provider implementation (DuckDB only)
- No time travel support

---

## Roadmap

This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository.

### Metadata & Catalog Improvements

- Metadata caching to reduce repeated catalog lookups
- Pluggable metadata providers beyond DuckDB:
- PostgreSQL
- SQLite
- MySQL
- Clear abstraction boundaries between catalog, metadata provider, and execution

### Query Planning & Performance

- Partition-aware file pruning
- Improved predicate pushdown
- Smarter Parquet I/O planning
- Reduced metadata round-trips during planning
- Better alignment with DataFusion optimizer rules

### Write Support

- Initial write support for DuckLake tables

### Time Travel & Versioning

## TODO
- [ ] Support caching metadata
- [ ] Support alternative metadata databases
- [ ] postgres
- [ ] sqlite
- [ ] mysql
- [ ] Writes
- [ ] Timetravel
- Querying historical snapshots
- Explicit snapshot selection

### Type System Expansion

- Improved support for complex and nested types
- Better alignment with DuckDB and DataFusion type semantics

### Stability & Ergonomics

- Expanded test coverage
- Improved error messages and diagnostics
- Cleaner APIs for embedding in other DataFusion-based systems
- Additional documentation and examples

---

## Usage

### Example

```bash
cargo run --example basic_query -- <catalog.db> <sql>

```

### Integration

```rust
use datafusion::execution::runtime_env::RuntimeEnv;
use datafusion::prelude::*;
Expand Down Expand Up @@ -79,4 +125,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog));
// Query
let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?;
df.show().await?;


```
### Project Status

This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined.

Feedback, issues, and contributions are welcome.
Loading