Summary
When tpchgen-cli output is registered into an Iceberg table via PyIceberg's add_files() (a common pattern for fast TPC-H benchmark setup against object stores), the resulting tables are unreadable by Polars' native Iceberg scanner. The fix on the user side is to rewrite every Parquet file post-generation to inject PARQUET:field_id into the schema metadata, which negates much of the speed advantage of using tpchgen-cli in the first place.
Background
The Iceberg spec allows Parquet files without embedded field IDs as long as the table carries a schema.name-mapping.default property. PyIceberg's add_files() honours this and PyIceberg/DuckDB readers resolve columns by name when IDs are absent. Polars' native (Rust) Iceberg reader does not — it reads field IDs directly from the Parquet thrift footer and throws SchemaFieldNotFoundError: failed to load 'PARQUET:field_id' ... metadata was None. The bug is tracked in pola-rs/polars#24915 and remains open; the documented workaround is reader_override="pyiceberg", which is significantly slower and isn't intended for production use.
For users benchmarking TPC-H on Iceberg with Polars as one of the engines, this means tpchgen-cli output currently can't be used directly — every file must be rewritten through PyArrow or DuckDB's COPY ... (FIELD_IDS {...}) before upload, which doubles generation time and IO.
Proposed feature
A flag, e.g. --iceberg-field-ids, that writes Iceberg-compatible field IDs into the Parquet schema metadata. The TPC-H schema is fixed, so the field-ID assignment for each table is deterministic and can be hardcoded — no user configuration needed. Field IDs should be embedded both in the Parquet thrift schema (for native readers like Polars) and ideally surfaced in the Arrow schema metadata as PARQUET:field_id (for any consumer reading via Arrow).
DuckDB's FIELD_IDS COPY option is a useful reference for the on-disk format.
Why this belongs in tpchgen-cli
it does not :) i don't think polars will fix it
Summary
When tpchgen-cli output is registered into an Iceberg table via PyIceberg's add_files() (a common pattern for fast TPC-H benchmark setup against object stores), the resulting tables are unreadable by Polars' native Iceberg scanner. The fix on the user side is to rewrite every Parquet file post-generation to inject PARQUET:field_id into the schema metadata, which negates much of the speed advantage of using tpchgen-cli in the first place.
Background
The Iceberg spec allows Parquet files without embedded field IDs as long as the table carries a schema.name-mapping.default property. PyIceberg's add_files() honours this and PyIceberg/DuckDB readers resolve columns by name when IDs are absent. Polars' native (Rust) Iceberg reader does not — it reads field IDs directly from the Parquet thrift footer and throws SchemaFieldNotFoundError: failed to load 'PARQUET:field_id' ... metadata was None. The bug is tracked in pola-rs/polars#24915 and remains open; the documented workaround is reader_override="pyiceberg", which is significantly slower and isn't intended for production use.
For users benchmarking TPC-H on Iceberg with Polars as one of the engines, this means tpchgen-cli output currently can't be used directly — every file must be rewritten through PyArrow or DuckDB's COPY ... (FIELD_IDS {...}) before upload, which doubles generation time and IO.
Proposed feature
A flag, e.g. --iceberg-field-ids, that writes Iceberg-compatible field IDs into the Parquet schema metadata. The TPC-H schema is fixed, so the field-ID assignment for each table is deterministic and can be hardcoded — no user configuration needed. Field IDs should be embedded both in the Parquet thrift schema (for native readers like Polars) and ideally surfaced in the Arrow schema metadata as PARQUET:field_id (for any consumer reading via Arrow).
DuckDB's FIELD_IDS COPY option is a useful reference for the on-disk format.
Why this belongs in tpchgen-cli
it does not :) i don't think polars will fix it