Skip to content

feat(table): implement TableProvider::statistics() with byte-size aggregate#112

Merged
anoop-narang merged 1 commit into
mainfrom
feat/duckake-table-statistics
May 5, 2026
Merged

feat(table): implement TableProvider::statistics() with byte-size aggregate#112
anoop-narang merged 1 commit into
mainfrom
feat/duckake-table-statistics

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Summary

DuckLakeTable previously inherited DataFusion's default TableProvider::statistics() -> None, so every query against a DuckLake table planned without any cost-based input — the optimizer had no idea how big tables were, joined the wrong way around, picked weak aggregation strategies, etc. This PR implements the override using per-file metadata already cached on the table:

fn statistics(&self) -> Option<Statistics> {
    let data_bytes: i64 = self.table_files.iter()
        .map(|f| f.file.file_size_bytes).sum();
    let delete_bytes: i64 = self.table_files.iter()
        .filter_map(|f| f.delete_file.as_ref())
        .map(|df| df.file_size_bytes).sum();
    let net_bytes = (data_bytes - delete_bytes).max(0) as usize;

    let mut stats = Statistics::new_unknown(&self.schema);
    stats.total_byte_size = Precision::Inexact(net_bytes);
    Some(stats)
}

The math mirrors DuckLake's own ducklake_table_info aggregate exactly:

total_byte_size == SUM(data_file.file_size_bytes)
                   - SUM(delete_file.file_size_bytes)

Both sides read the same ducklake_data_file rows from the catalog. So for any (schema, table, snapshot) the value our statistics() returns is byte-identical to what ducklake_table_info reports.

The work is purely in-memory iteration over already-cached metadata — sub-microsecond, zero I/O, zero new SQL. By the time statistics() is callable, DuckLakeTable::new has already loaded every file's metadata via MetadataProvider::get_table_files_for_select (see table.rs:154-156).

Why Precision::Inexact

DataFusion documents total_byte_size as the "uncompressed Arrow output size", while what the catalog tracks is compressed parquet bytes. For wide types (List(Float64) embeddings) the two are nearly identical; for narrow scalar schemas on-disk is 3-5× smaller than Arrow output. Reporting compressed bytes Inexact gives consumers a useful lower-bound estimate without misleading the optimizer into trusting it as exact Arrow size.

When record_count is plumbed into DuckLakeFileData (see "Future enrichment" below), a follow-up can populate num_rows and use Statistics::calculate_total_byte_size(&schema) to derive an Arrow-side estimate.

Validation against ducklake_table_info

The committed integration test (test_statistics_total_byte_size_matches_catalog_aggregate in tests/table_tests.rs) creates a populated DuckLake-backed catalog and asserts statistics().total_byte_size == SUM(file_size_bytes) - SUM(delete_file_size_bytes) byte-for-byte using the same MetadataProvider API the impl reads from.

For external validation, the same property holds against a real ducklake catalog. Below: ducklake_table_info output for every table in a Postgres-backed Tigris-storage ducklake catalog (42 tables across 6 schemas, sizes spanning 1 KB to 8 GB). For each row, data_bytes - del_bytes is exactly what DuckLakeTable::statistics().total_byte_size will report after this PR, by construction.

schema table files data_bytes del_files del_bytes canonical (== total_byte_size)
main demo_users 2 2,225 0 0 2,225
main food101 1 207,987,774 0 0 207,987,774
main internet_pages 2 587,035,741 0 0 587,035,741
main internet_pages_small 1 46,286 0 0 46,286
sphere_vector_1m _queries 7 60,309 0 0 60,309
sphere_vector_1m vectors 17 8,044,745,778 0 0 8,044,745,778
tpch_sf0001 _queries 22 43,953 0 0 43,953
tpch_sf0001 customer 1 15,237 0 0 15,237
tpch_sf0001 lineitem 1 204,681 0 0 204,681
tpch_sf0001 nation 1 2,324 0 0 2,324
tpch_sf0001 orders 1 60,355 0 0 60,355
tpch_sf0001 part 1 11,327 0 0 11,327
tpch_sf0001 partsupp 1 44,378 0 0 44,378
tpch_sf0001 region 1 1,072 0 0 1,072
tpch_sf0001 supplier 1 2,247 0 0 2,247
tpch_sf001 _queries 22 43,953 0 0 43,953
tpch_sf001 customer 1 125,840 0 0 125,840
tpch_sf001 lineitem 1 1,822,420 0 0 1,822,420
tpch_sf001 nation 1 2,324 0 0 2,324
tpch_sf001 orders 1 537,447 0 0 537,447
tpch_sf001 part 1 69,368 0 0 69,368
tpch_sf001 partsupp 1 428,280 0 0 428,280
tpch_sf001 region 1 1,072 0 0 1,072
tpch_sf001 supplier 1 10,461 0 0 10,461
tpch_sf01 _queries 22 43,953 0 0 43,953
tpch_sf01 customer 1 1,238,688 0 0 1,238,688
tpch_sf01 lineitem 1 18,951,829 0 0 18,951,829
tpch_sf01 nation 1 2,324 0 0 2,324
tpch_sf01 orders 1 5,309,094 0 0 5,309,094
tpch_sf01 part 1 637,250 0 0 637,250
tpch_sf01 partsupp 1 4,180,238 0 0 4,180,238
tpch_sf01 region 1 1,072 0 0 1,072
tpch_sf01 supplier 1 82,133 0 0 82,133
tpch_sf1 _queries 22 40,890 0 0 40,890
tpch_sf1 customer 1 12,322,113 0 0 12,322,113
tpch_sf1 lineitem 1 207,129,996 0 0 207,129,996
tpch_sf1 nation 1 2,324 0 0 2,324
tpch_sf1 orders 1 56,093,865 0 0 56,093,865
tpch_sf1 part 1 6,363,508 0 0 6,363,508
tpch_sf1 partsupp 1 42,643,725 0 0 42,643,725
tpch_sf1 region 1 1,072 0 0 1,072
tpch_sf1 supplier 1 793,635 0 0 793,635

Captured via:

ATTACH 'ducklake:postgres:...' AS lake (DATA_PATH 's3://.../');
SELECT s.schema_name, t.table_name, t.file_count AS files,
       t.file_size_bytes AS data_bytes, t.delete_file_count AS del_files,
       t.delete_file_size_bytes AS del_bytes,
       t.file_size_bytes - t.delete_file_size_bytes AS canonical
FROM ducklake_table_info('lake') t
JOIN __ducklake_metadata_lake.ducklake_schema s ON s.schema_id = t.schema_id
ORDER BY s.schema_name, t.table_name;

Future enrichment (additive, no API change)

The Statistics shell this PR populates supports far more than total_byte_size. The following extensions can land as separate PRs without changing statistics()'s return type or any caller:

  1. num_rows — extend per-backend SELECT in metadata_provider_*.rs to project record_count (already a column in ducklake_data_file), add it to DuckLakeFileData, sum on read. Marked Precision::Exact once available; falls back to Absent if unsupported. Once known, Statistics::calculate_total_byte_size(&schema) can replace the compressed-bytes estimate with an Arrow-side one.
  2. column_statistics[i].{min, max, null_count} — DuckLake's catalog tracks per-column stats per data file; plumbing each unlocks predicate pushdown / partition pruning for free in DataFusion's planner. Precision::Absent columns upgrade to Exact non-disruptively.

Test plan

  • cargo build --all-features clean
  • cargo fmt --check clean
  • cargo clippy --all-features --all-targets -- -D warnings clean
  • cargo test --all-features — all tests pass except test_read_pme_encrypted_parquet which is a pre-existing failure on main (Binder Error: parent_column not found in FROM clause) unrelated to this change. Verified by reproducing on main without these commits.
  • New test test_statistics_total_byte_size_matches_catalog_aggregate asserts byte-for-byte equality between statistics() and the canonical catalog aggregate.
  • New test test_statistics_zero_for_empty_table covers the no-data-file edge case.
  • External validation against a real Postgres-backed ducklake catalog: 42 tables, all size buckets from KB to GB, captured above.

…regate

DataFusion's `TableProvider::statistics()` is the canonical hook the
optimizer calls to ask "what does this table look like" — used for join
ordering, cardinality estimation, partition pruning decisions, etc.
`DuckLakeTable` previously inherited the trait's default `None`, so the
optimizer planned every DuckLake query without any cost-based input.

Override the method to populate `Statistics.total_byte_size` from the
per-file byte sizes already cached on `DuckLakeTable.table_files`. The
math mirrors DuckLake's own `ducklake_table_info` aggregate exactly:

    total_byte_size == SUM(data_file.file_size_bytes)
                       - SUM(delete_file.file_size_bytes)

Both sides read the same `ducklake_data_file` rows from the catalog,
just expressed via different APIs — so for any (schema, table, snapshot)
the value `statistics()` returns is byte-identical to what
`ducklake_table_info` reports.

Marked `Precision::Inexact`. DataFusion documents `total_byte_size` as
the "uncompressed Arrow output" size, while what the catalog tracks is
*compressed parquet bytes*. For wide types (e.g. `List(Float64)`
embedding columns) the two are nearly identical; for narrow scalar
schemas the on-disk number is 3-5× smaller than Arrow output. Reporting
compressed bytes Inexact gives consumers a useful lower-bound estimate
without misleading the optimizer into trusting it as exact Arrow size.

The implementation is purely in-memory iteration over already-cached
metadata — sub-microsecond, zero I/O, zero new SQL. By the time
`statistics()` is callable, `DuckLakeTable::new` has already loaded
every file's metadata via `MetadataProvider::get_table_files_for_select`
(see table.rs:154-156).

Future enrichment paths the same `Statistics` shell supports without
any further API surface change:

- `num_rows` — extend per-backend SELECTs to project `record_count`
  (already a column in `ducklake_data_file`), add it to
  `DuckLakeFileData`, sum on read.
- `column_statistics[i].{min,max,null_count}` — DuckLake's catalog
  tracks per-column stats; plumbing each unlocks predicate pushdown.

`Precision::Absent` cleanly upgrades to `Exact`/`Inexact` with no
breakage.

Tests:

- `test_statistics_total_byte_size_matches_catalog_aggregate` — creates
  a populated DuckLake-backed table, asserts `statistics().total_byte_size`
  equals `SUM(file_size_bytes) - SUM(delete_file_size_bytes)` from the
  same catalog, byte-for-byte.
- `test_statistics_zero_for_empty_table` — empty tables report 0 bytes
  rather than `Absent`.
@anoop-narang anoop-narang force-pushed the feat/duckake-table-statistics branch from 1a7d803 to f8b37ad Compare May 5, 2026 07:48
@anoop-narang anoop-narang merged commit c74a717 into main May 5, 2026
3 checks passed
@anoop-narang anoop-narang deleted the feat/duckake-table-statistics branch May 5, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant