feat(table): implement TableProvider::statistics() with byte-size aggregate#112
Merged
Conversation
…regate
DataFusion's `TableProvider::statistics()` is the canonical hook the
optimizer calls to ask "what does this table look like" — used for join
ordering, cardinality estimation, partition pruning decisions, etc.
`DuckLakeTable` previously inherited the trait's default `None`, so the
optimizer planned every DuckLake query without any cost-based input.
Override the method to populate `Statistics.total_byte_size` from the
per-file byte sizes already cached on `DuckLakeTable.table_files`. The
math mirrors DuckLake's own `ducklake_table_info` aggregate exactly:
total_byte_size == SUM(data_file.file_size_bytes)
- SUM(delete_file.file_size_bytes)
Both sides read the same `ducklake_data_file` rows from the catalog,
just expressed via different APIs — so for any (schema, table, snapshot)
the value `statistics()` returns is byte-identical to what
`ducklake_table_info` reports.
Marked `Precision::Inexact`. DataFusion documents `total_byte_size` as
the "uncompressed Arrow output" size, while what the catalog tracks is
*compressed parquet bytes*. For wide types (e.g. `List(Float64)`
embedding columns) the two are nearly identical; for narrow scalar
schemas the on-disk number is 3-5× smaller than Arrow output. Reporting
compressed bytes Inexact gives consumers a useful lower-bound estimate
without misleading the optimizer into trusting it as exact Arrow size.
The implementation is purely in-memory iteration over already-cached
metadata — sub-microsecond, zero I/O, zero new SQL. By the time
`statistics()` is callable, `DuckLakeTable::new` has already loaded
every file's metadata via `MetadataProvider::get_table_files_for_select`
(see table.rs:154-156).
Future enrichment paths the same `Statistics` shell supports without
any further API surface change:
- `num_rows` — extend per-backend SELECTs to project `record_count`
(already a column in `ducklake_data_file`), add it to
`DuckLakeFileData`, sum on read.
- `column_statistics[i].{min,max,null_count}` — DuckLake's catalog
tracks per-column stats; plumbing each unlocks predicate pushdown.
`Precision::Absent` cleanly upgrades to `Exact`/`Inexact` with no
breakage.
Tests:
- `test_statistics_total_byte_size_matches_catalog_aggregate` — creates
a populated DuckLake-backed table, asserts `statistics().total_byte_size`
equals `SUM(file_size_bytes) - SUM(delete_file_size_bytes)` from the
same catalog, byte-for-byte.
- `test_statistics_zero_for_empty_table` — empty tables report 0 bytes
rather than `Absent`.
1a7d803 to
f8b37ad
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DuckLakeTablepreviously inherited DataFusion's defaultTableProvider::statistics() -> None, so every query against a DuckLake table planned without any cost-based input — the optimizer had no idea how big tables were, joined the wrong way around, picked weak aggregation strategies, etc. This PR implements the override using per-file metadata already cached on the table:The math mirrors DuckLake's own
ducklake_table_infoaggregate exactly:Both sides read the same
ducklake_data_filerows from the catalog. So for any(schema, table, snapshot)the value ourstatistics()returns is byte-identical to whatducklake_table_inforeports.The work is purely in-memory iteration over already-cached metadata — sub-microsecond, zero I/O, zero new SQL. By the time
statistics()is callable,DuckLakeTable::newhas already loaded every file's metadata viaMetadataProvider::get_table_files_for_select(seetable.rs:154-156).Why
Precision::InexactDataFusion documents
total_byte_sizeas the "uncompressed Arrow output size", while what the catalog tracks is compressed parquet bytes. For wide types (List(Float64)embeddings) the two are nearly identical; for narrow scalar schemas on-disk is 3-5× smaller than Arrow output. Reporting compressed bytesInexactgives consumers a useful lower-bound estimate without misleading the optimizer into trusting it as exact Arrow size.When
record_countis plumbed intoDuckLakeFileData(see "Future enrichment" below), a follow-up can populatenum_rowsand useStatistics::calculate_total_byte_size(&schema)to derive an Arrow-side estimate.Validation against
ducklake_table_infoThe committed integration test (
test_statistics_total_byte_size_matches_catalog_aggregateintests/table_tests.rs) creates a populated DuckLake-backed catalog and assertsstatistics().total_byte_size == SUM(file_size_bytes) - SUM(delete_file_size_bytes)byte-for-byte using the sameMetadataProviderAPI the impl reads from.For external validation, the same property holds against a real ducklake catalog. Below:
ducklake_table_infooutput for every table in a Postgres-backed Tigris-storage ducklake catalog (42 tables across 6 schemas, sizes spanning 1 KB to 8 GB). For each row,data_bytes - del_bytesis exactly whatDuckLakeTable::statistics().total_byte_sizewill report after this PR, by construction.total_byte_size)Captured via:
Future enrichment (additive, no API change)
The
Statisticsshell this PR populates supports far more thantotal_byte_size. The following extensions can land as separate PRs without changingstatistics()'s return type or any caller:num_rows— extend per-backendSELECTinmetadata_provider_*.rsto projectrecord_count(already a column inducklake_data_file), add it toDuckLakeFileData, sum on read. MarkedPrecision::Exactonce available; falls back toAbsentif unsupported. Once known,Statistics::calculate_total_byte_size(&schema)can replace the compressed-bytes estimate with an Arrow-side one.column_statistics[i].{min, max, null_count}— DuckLake's catalog tracks per-column stats per data file; plumbing each unlocks predicate pushdown / partition pruning for free in DataFusion's planner.Precision::Absentcolumns upgrade toExactnon-disruptively.Test plan
cargo build --all-featurescleancargo fmt --checkcleancargo clippy --all-features --all-targets -- -D warningscleancargo test --all-features— all tests pass excepttest_read_pme_encrypted_parquetwhich is a pre-existing failure onmain(Binder Error: parent_column not found in FROM clause) unrelated to this change. Verified by reproducing onmainwithout these commits.test_statistics_total_byte_size_matches_catalog_aggregateasserts byte-for-byte equality betweenstatistics()and the canonical catalog aggregate.test_statistics_zero_for_empty_tablecovers the no-data-file edge case.