feat(table): implement TableProvider::statistics() with byte-size aggregate by anoop-narang · Pull Request #112 · datafusion-contrib/datafusion-ducklake

anoop-narang · 2026-05-05T07:40:35Z

Summary

DuckLakeTable previously inherited DataFusion's default TableProvider::statistics() -> None, so every query against a DuckLake table planned without any cost-based input — the optimizer had no idea how big tables were, joined the wrong way around, picked weak aggregation strategies, etc. This PR implements the override using per-file metadata already cached on the table:

fn statistics(&self) -> Option<Statistics> {
    let data_bytes: i64 = self.table_files.iter()
        .map(|f| f.file.file_size_bytes).sum();
    let delete_bytes: i64 = self.table_files.iter()
        .filter_map(|f| f.delete_file.as_ref())
        .map(|df| df.file_size_bytes).sum();
    let net_bytes = (data_bytes - delete_bytes).max(0) as usize;

    let mut stats = Statistics::new_unknown(&self.schema);
    stats.total_byte_size = Precision::Inexact(net_bytes);
    Some(stats)
}

The math mirrors DuckLake's own ducklake_table_info aggregate exactly:

total_byte_size == SUM(data_file.file_size_bytes)
                   - SUM(delete_file.file_size_bytes)

Both sides read the same ducklake_data_file rows from the catalog. So for any (schema, table, snapshot) the value our statistics() returns is byte-identical to what ducklake_table_info reports.

The work is purely in-memory iteration over already-cached metadata — sub-microsecond, zero I/O, zero new SQL. By the time statistics() is callable, DuckLakeTable::new has already loaded every file's metadata via MetadataProvider::get_table_files_for_select (see table.rs:154-156).

Why `Precision::Inexact`

DataFusion documents total_byte_size as the "uncompressed Arrow output size", while what the catalog tracks is compressed parquet bytes. For wide types (List(Float64) embeddings) the two are nearly identical; for narrow scalar schemas on-disk is 3-5× smaller than Arrow output. Reporting compressed bytes Inexact gives consumers a useful lower-bound estimate without misleading the optimizer into trusting it as exact Arrow size.

When record_count is plumbed into DuckLakeFileData (see "Future enrichment" below), a follow-up can populate num_rows and use Statistics::calculate_total_byte_size(&schema) to derive an Arrow-side estimate.

Validation against `ducklake_table_info`

The committed integration test (test_statistics_total_byte_size_matches_catalog_aggregate in tests/table_tests.rs) creates a populated DuckLake-backed catalog and asserts statistics().total_byte_size == SUM(file_size_bytes) - SUM(delete_file_size_bytes) byte-for-byte using the same MetadataProvider API the impl reads from.

For external validation, the same property holds against a real ducklake catalog. Below: ducklake_table_info output for every table in a Postgres-backed Tigris-storage ducklake catalog (42 tables across 6 schemas, sizes spanning 1 KB to 8 GB). For each row, data_bytes - del_bytes is exactly what DuckLakeTable::statistics().total_byte_size will report after this PR, by construction.

schema	table	files	data_bytes	canonical (== `total_byte_size`)
main	demo_users	2	2,225	2,225
main	food101	1	207,987,774	207,987,774
main	internet_pages	2	587,035,741	587,035,741
main	internet_pages_small	1	46,286	46,286
sphere_vector_1m	_queries	7	60,309	60,309
sphere_vector_1m	vectors	17	8,044,745,778	8,044,745,778
tpch_sf0001	_queries	22	43,953	43,953
tpch_sf0001	customer	1	15,237	15,237
tpch_sf0001	lineitem	1	204,681	204,681
tpch_sf0001	nation	1	2,324	2,324
tpch_sf0001	orders	1	60,355	60,355
tpch_sf0001	part	1	11,327	11,327
tpch_sf0001	partsupp	1	44,378	44,378
tpch_sf0001	region	1	1,072	1,072
tpch_sf0001	supplier	1	2,247	2,247
tpch_sf001	_queries	22	43,953	43,953
tpch_sf001	customer	1	125,840	125,840
tpch_sf001	lineitem	1	1,822,420	1,822,420
tpch_sf001	nation	1	2,324	2,324
tpch_sf001	orders	1	537,447	537,447
tpch_sf001	part	1	69,368	69,368
tpch_sf001	partsupp	1	428,280	428,280
tpch_sf001	region	1	1,072	1,072
tpch_sf001	supplier	1	10,461	10,461
tpch_sf01	_queries	22	43,953	43,953
tpch_sf01	customer	1	1,238,688	1,238,688
tpch_sf01	lineitem	1	18,951,829	18,951,829
tpch_sf01	nation	1	2,324	2,324
tpch_sf01	orders	1	5,309,094	5,309,094
tpch_sf01	part	1	637,250	637,250
tpch_sf01	partsupp	1	4,180,238	4,180,238
tpch_sf01	region	1	1,072	1,072
tpch_sf01	supplier	1	82,133	82,133
tpch_sf1	_queries	22	40,890	40,890
tpch_sf1	customer	1	12,322,113	12,322,113
tpch_sf1	lineitem	1	207,129,996	207,129,996
tpch_sf1	nation	1	2,324	2,324
tpch_sf1	orders	1	56,093,865	56,093,865
tpch_sf1	part	1	6,363,508	6,363,508
tpch_sf1	partsupp	1	42,643,725	42,643,725
tpch_sf1	region	1	1,072	1,072
tpch_sf1	supplier	1	793,635	793,635

Captured via:

ATTACH 'ducklake:postgres:...' AS lake (DATA_PATH 's3://.../');
SELECT s.schema_name, t.table_name, t.file_count AS files,
       t.file_size_bytes AS data_bytes, t.delete_file_count AS del_files,
       t.delete_file_size_bytes AS del_bytes,
       t.file_size_bytes - t.delete_file_size_bytes AS canonical
FROM ducklake_table_info('lake') t
JOIN __ducklake_metadata_lake.ducklake_schema s ON s.schema_id = t.schema_id
ORDER BY s.schema_name, t.table_name;

Future enrichment (additive, no API change)

The Statistics shell this PR populates supports far more than total_byte_size. The following extensions can land as separate PRs without changing statistics()'s return type or any caller:

num_rows — extend per-backend SELECT in metadata_provider_*.rs to project record_count (already a column in ducklake_data_file), add it to DuckLakeFileData, sum on read. Marked Precision::Exact once available; falls back to Absent if unsupported. Once known, Statistics::calculate_total_byte_size(&schema) can replace the compressed-bytes estimate with an Arrow-side one.
column_statistics[i].{min, max, null_count} — DuckLake's catalog tracks per-column stats per data file; plumbing each unlocks predicate pushdown / partition pruning for free in DataFusion's planner. Precision::Absent columns upgrade to Exact non-disruptively.

Test plan

cargo build --all-features clean
cargo fmt --check clean
cargo clippy --all-features --all-targets -- -D warnings clean
cargo test --all-features — all tests pass except test_read_pme_encrypted_parquet which is a pre-existing failure on main (Binder Error: parent_column not found in FROM clause) unrelated to this change. Verified by reproducing on main without these commits.
New test test_statistics_total_byte_size_matches_catalog_aggregate asserts byte-for-byte equality between statistics() and the canonical catalog aggregate.
New test test_statistics_zero_for_empty_table covers the no-data-file edge case.
External validation against a real Postgres-backed ducklake catalog: 42 tables, all size buckets from KB to GB, captured above.

…regate DataFusion's `TableProvider::statistics()` is the canonical hook the optimizer calls to ask "what does this table look like" — used for join ordering, cardinality estimation, partition pruning decisions, etc. `DuckLakeTable` previously inherited the trait's default `None`, so the optimizer planned every DuckLake query without any cost-based input. Override the method to populate `Statistics.total_byte_size` from the per-file byte sizes already cached on `DuckLakeTable.table_files`. The math mirrors DuckLake's own `ducklake_table_info` aggregate exactly: total_byte_size == SUM(data_file.file_size_bytes) - SUM(delete_file.file_size_bytes) Both sides read the same `ducklake_data_file` rows from the catalog, just expressed via different APIs — so for any (schema, table, snapshot) the value `statistics()` returns is byte-identical to what `ducklake_table_info` reports. Marked `Precision::Inexact`. DataFusion documents `total_byte_size` as the "uncompressed Arrow output" size, while what the catalog tracks is *compressed parquet bytes*. For wide types (e.g. `List(Float64)` embedding columns) the two are nearly identical; for narrow scalar schemas the on-disk number is 3-5× smaller than Arrow output. Reporting compressed bytes Inexact gives consumers a useful lower-bound estimate without misleading the optimizer into trusting it as exact Arrow size. The implementation is purely in-memory iteration over already-cached metadata — sub-microsecond, zero I/O, zero new SQL. By the time `statistics()` is callable, `DuckLakeTable::new` has already loaded every file's metadata via `MetadataProvider::get_table_files_for_select` (see table.rs:154-156). Future enrichment paths the same `Statistics` shell supports without any further API surface change: - `num_rows` — extend per-backend SELECTs to project `record_count` (already a column in `ducklake_data_file`), add it to `DuckLakeFileData`, sum on read. - `column_statistics[i].{min,max,null_count}` — DuckLake's catalog tracks per-column stats; plumbing each unlocks predicate pushdown. `Precision::Absent` cleanly upgrades to `Exact`/`Inexact` with no breakage. Tests: - `test_statistics_total_byte_size_matches_catalog_aggregate` — creates a populated DuckLake-backed table, asserts `statistics().total_byte_size` equals `SUM(file_size_bytes) - SUM(delete_file_size_bytes)` from the same catalog, byte-for-byte. - `test_statistics_zero_for_empty_table` — empty tables report 0 bytes rather than `Absent`.

anoop-narang force-pushed the feat/duckake-table-statistics branch from 1a7d803 to f8b37ad Compare May 5, 2026 07:48

anoop-narang merged commit c74a717 into main May 5, 2026
3 checks passed

anoop-narang deleted the feat/duckake-table-statistics branch May 5, 2026 08:26

anoop-narang mentioned this pull request May 5, 2026

chore(release): prepare v0.2.1 #113

Merged

zfarrell mentioned this pull request May 8, 2026

[epic] Full feature parity with DuckLake v1.0 #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(table): implement TableProvider::statistics() with byte-size aggregate#112

feat(table): implement TableProvider::statistics() with byte-size aggregate#112
anoop-narang merged 1 commit into
mainfrom
feat/duckake-table-statistics

anoop-narang commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anoop-narang commented May 5, 2026

Summary

Why Precision::Inexact

Validation against ducklake_table_info

Future enrichment (additive, no API change)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why `Precision::Inexact`

Validation against `ducklake_table_info`