Skip to content

Commit b8b5aa2

Browse files
authored
Refine README content and add roadmap section
Updated README.md for clarity and added roadmap details.
1 parent 304ed9b commit b8b5aa2

1 file changed

Lines changed: 69 additions & 13 deletions

File tree

README.md

Lines changed: 69 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,95 @@
11
# DataFusion-DuckLake
22

3-
**This is an early pre-release, that is very much so a work in progress.**
3+
**This is an early pre-release and very much a work in progress.**
44

55
A DataFusion extension for querying [DuckLake](https://ducklake.select). DuckLake is an integrated data lake and catalog format that stores metadata in SQL databases and data as Parquet files on disk or object storage.
66

7+
The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion.
8+
9+
---
10+
711
## Currently Supported
812

913
- Read-only queries against DuckLake catalogs
1014
- DuckDB catalog backend
1115
- Local filesystem and S3-compatible object stores (MinIO, S3)
1216
- Snapshot-based consistency
1317
- Basic and decimal types
14-
- Hierarchical path resolution (data_path, schema, table, file)
15-
- Delete files for row-level deletion (MOR - Merge-On-Read)
18+
- Hierarchical path resolution (`data_path`, `schema`, `table`, `file`)
19+
- Delete files for row-level deletion (MOR Merge-On-Read)
1620
- Parquet footer size hints for optimized I/O
1721
- Filter pushdown to Parquet for row group pruning and page-level filtering
1822
- Dynamic metadata lookup (no upfront catalog caching)
1923

24+
---
25+
2026
## Known Limitations
2127

2228
- Complex types (nested lists, structs, maps) have minimal support
2329
- No write operations
24-
- No filter-based file pruning (partition pruning not yet implemented)
30+
- No partition-based file pruning
2531
- Single metadata provider implementation (DuckDB only)
32+
- No time travel support
33+
34+
---
35+
36+
## Roadmap
37+
38+
This project is under active development. The roadmap below reflects major areas of work currently underway or planned next. For the most up-to-date view, see the open issues and pull requests in this repository.
39+
40+
### Metadata & Catalog Improvements
41+
42+
- Metadata caching to reduce repeated catalog lookups
43+
- Pluggable metadata providers beyond DuckDB:
44+
- PostgreSQL
45+
- SQLite
46+
- MySQL
47+
- Clear abstraction boundaries between catalog, metadata provider, and execution
48+
49+
### Query Planning & Performance
50+
51+
- Partition-aware file pruning
52+
- Improved predicate pushdown
53+
- Smarter Parquet I/O planning
54+
- Reduced metadata round-trips during planning
55+
- Better alignment with DataFusion optimizer rules
56+
57+
### Write Support
58+
59+
- Initial write support for DuckLake tables
60+
- Append-only writes
61+
- Foundations for future upsert and delete workflows
62+
- Proper snapshot and commit handling
63+
64+
### Time Travel & Versioning
2665

27-
## TODO
28-
- [ ] Support caching metadata
29-
- [ ] Support alternative metadata databases
30-
- [ ] postgres
31-
- [ ] sqlite
32-
- [ ] mysql
33-
- [ ] Writes
34-
- [ ] Timetravel
66+
- Querying historical snapshots
67+
- Explicit snapshot selection
68+
69+
### Type System Expansion
70+
71+
- Improved support for complex and nested types
72+
- Better alignment with DuckDB and DataFusion type semantics
73+
74+
### Stability & Ergonomics
75+
76+
- Expanded test coverage
77+
- Improved error messages and diagnostics
78+
- Cleaner APIs for embedding in other DataFusion-based systems
79+
- Additional documentation and examples
80+
81+
---
3582

3683
## Usage
84+
3785
### Example
86+
3887
```bash
3988
cargo run --example basic_query -- <catalog.db> <sql>
89+
4090
```
4191

4292
### Integration
43-
4493
```rust
4594
use datafusion::execution::runtime_env::RuntimeEnv;
4695
use datafusion::prelude::*;
@@ -79,4 +128,11 @@ ctx.register_catalog("ducklake", Arc::new(catalog));
79128
// Query
80129
let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?;
81130
df.show().await?;
131+
132+
82133
```
134+
### Project Status
135+
136+
This project is evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined.
137+
138+
Feedback, issues, and contributions are welcome.

0 commit comments

Comments
 (0)