What happened + What you expected to happen
On a ray.data.read_iceberg dataset, calling .count() after .select_columns([...])
returns 0 even though the table is fully populated. It is a count()-only
bug; the data is correct (take/take_all/iter_batches/to_arrow_refs/materialize
all return the full rows). Only the lazy count() is wrong.
Likely Root cause
Dataset.count() optimizes by projecting to zero columns:
Count(Project(dag, exprs=[])) (ray/data/dataset.py), to avoid carrying data.
ProjectionPushdown fuses that empty projection into the iceberg Read, setting
the IcebergDatasource's selected_fields=().
- pyiceberg's
table.scan(selected_fields=()) returns 0 rows for an empty field
selection (not "N rows × 0 columns").
- This path is only reached because
Project.infer_metadata() returns
num_rows=None (it doesn't propagate the input row count, even though a column
projection has can_modify_num_rows=False). So Dataset._meta_count() can't
short-circuit and falls into the zero-column count path above.
A plain read_iceberg(...).count() (no select_columns) is correct, because
_meta_count() returns Read.infer_metadata().num_rows (the iceberg manifest
record_count) and never builds the zero-column count.
Suggested fix (either is sufficient)
- Make
Project.infer_metadata() propagate num_rows from its input when
can_modify_num_rows is False. Then _meta_count() short-circuits and the
zero-column count is never reached.
- And/or don't push an empty projection into sources where empty
selected_fields
means "no rows" (or have IcebergDatasource treat empty selected_fields as
"all rows, no columns").
Workaround
Count a projected dataset via ds.select_columns(...).materialize().count()
(or any data op), not lazy .count().
Versions / Dependencies
ray[data] 2.55.1
pyiceberg 0.11.102
- Python 3.10
Reproduction script
import tempfile, pathlib
import pyarrow as pa
import ray
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, LongType, StringType
tmp = tempfile.mkdtemp()
wh = pathlib.Path(tmp) / "wh"; wh.mkdir()
cat = SqlCatalog("local", uri=f"sqlite:///{tmp}/c.db", warehouse=f"file://{wh}")
cat.create_namespace("db")
tbl = cat.create_table("db.t", schema=Schema(
NestedField(1, "id", LongType(), required=False),
NestedField(2, "country", StringType(), required=False),
))
tbl.append(pa.table({"id": pa.array(range(100), pa.int64()),
"country": pa.array(["US"] * 100, pa.string())}))
ck = {"name": "local", "uri": f"sqlite:///{tmp}/c.db", "warehouse": f"file://{wh}"}
ds = ray.data.read_iceberg(table_identifier="db.t", catalog_kwargs=ck)
print(ds.count()) # 100 (correct)
print(ds.select_columns(["country"]).count()) # 0 <-- BUG
print(ds.select_columns(["country"]).materialize().count()) # 100 (correct)
print(len(ds.select_columns(["country"]).take_all())) # 100 (data is fine)
Expected: select_columns(["country"]).count() == 100
Actual: 0
Issue Severity
Medium: It is a significant difficulty but I can work around it.
What happened + What you expected to happen
On a
ray.data.read_icebergdataset, calling.count()after.select_columns([...])returns 0 even though the table is fully populated. It is a
count()-onlybug; the data is correct (
take/take_all/iter_batches/to_arrow_refs/materializeall return the full rows). Only the lazy
count()is wrong.Likely Root cause
Dataset.count()optimizes by projecting to zero columns:Count(Project(dag, exprs=[]))(ray/data/dataset.py), to avoid carrying data.ProjectionPushdownfuses that empty projection into the icebergRead, settingthe
IcebergDatasource'sselected_fields=().table.scan(selected_fields=())returns 0 rows for an empty fieldselection (not "N rows × 0 columns").
Project.infer_metadata()returnsnum_rows=None(it doesn't propagate the input row count, even though a columnprojection has
can_modify_num_rows=False). SoDataset._meta_count()can'tshort-circuit and falls into the zero-column count path above.
A plain
read_iceberg(...).count()(noselect_columns) is correct, because_meta_count()returnsRead.infer_metadata().num_rows(the iceberg manifestrecord_count) and never builds the zero-column count.Suggested fix (either is sufficient)
Project.infer_metadata()propagatenum_rowsfrom its input whencan_modify_num_rows is False. Then_meta_count()short-circuits and thezero-column count is never reached.
selected_fieldsmeans "no rows" (or have
IcebergDatasourcetreat emptyselected_fieldsas"all rows, no columns").
Workaround
Count a projected dataset via
ds.select_columns(...).materialize().count()(or any data op), not lazy
.count().Versions / Dependencies
ray[data]2.55.1pyiceberg0.11.102Reproduction script
Expected:
select_columns(["country"]).count() == 100Actual:
0Issue Severity
Medium: It is a significant difficulty but I can work around it.