Skip to content

read_iceberg(...).select_columns([...]).count() returns 0 #63913

@dgshep

Description

@dgshep

What happened + What you expected to happen

On a ray.data.read_iceberg dataset, calling .count() after .select_columns([...])
returns 0 even though the table is fully populated. It is a count()-only
bug; the data is correct (take/take_all/iter_batches/to_arrow_refs/materialize
all return the full rows). Only the lazy count() is wrong.

Likely Root cause

  1. Dataset.count() optimizes by projecting to zero columns:
    Count(Project(dag, exprs=[])) (ray/data/dataset.py), to avoid carrying data.
  2. ProjectionPushdown fuses that empty projection into the iceberg Read, setting
    the IcebergDatasource's selected_fields=().
  3. pyiceberg's table.scan(selected_fields=()) returns 0 rows for an empty field
    selection (not "N rows × 0 columns").
  4. This path is only reached because Project.infer_metadata() returns
    num_rows=None
    (it doesn't propagate the input row count, even though a column
    projection has can_modify_num_rows=False). So Dataset._meta_count() can't
    short-circuit and falls into the zero-column count path above.

A plain read_iceberg(...).count() (no select_columns) is correct, because
_meta_count() returns Read.infer_metadata().num_rows (the iceberg manifest
record_count) and never builds the zero-column count.

Suggested fix (either is sufficient)

  • Make Project.infer_metadata() propagate num_rows from its input when
    can_modify_num_rows is False. Then _meta_count() short-circuits and the
    zero-column count is never reached.
  • And/or don't push an empty projection into sources where empty selected_fields
    means "no rows" (or have IcebergDatasource treat empty selected_fields as
    "all rows, no columns").

Workaround

Count a projected dataset via ds.select_columns(...).materialize().count()
(or any data op), not lazy .count().

Versions / Dependencies

  • ray[data] 2.55.1
  • pyiceberg 0.11.102
  • Python 3.10

Reproduction script

import tempfile, pathlib
import pyarrow as pa
import ray
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, LongType, StringType

tmp = tempfile.mkdtemp()
wh = pathlib.Path(tmp) / "wh"; wh.mkdir()
cat = SqlCatalog("local", uri=f"sqlite:///{tmp}/c.db", warehouse=f"file://{wh}")
cat.create_namespace("db")
tbl = cat.create_table("db.t", schema=Schema(
    NestedField(1, "id", LongType(), required=False),
    NestedField(2, "country", StringType(), required=False),
))
tbl.append(pa.table({"id": pa.array(range(100), pa.int64()),
                     "country": pa.array(["US"] * 100, pa.string())}))

ck = {"name": "local", "uri": f"sqlite:///{tmp}/c.db", "warehouse": f"file://{wh}"}
ds = ray.data.read_iceberg(table_identifier="db.t", catalog_kwargs=ck)

print(ds.count())                                            # 100  (correct)
print(ds.select_columns(["country"]).count())                # 0    <-- BUG
print(ds.select_columns(["country"]).materialize().count())  # 100  (correct)
print(len(ds.select_columns(["country"]).take_all()))        # 100  (data is fine)

Expected: select_columns(["country"]).count() == 100
Actual: 0

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesstabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions