- 
                Notifications
    
You must be signed in to change notification settings  - Fork 328
 
Description
Describe the bug
I'm encountering an issue where filtering a Daft DataFrame read from a Lance dataset and then using show() with a limit produces unexpected results. It appears that the limit is being applied before the filter, which is not the intended behavior.
Notably, this issue does NOT occur with other data formats like Arrow - only with Lance.
To Reproduce
  import tempfile
  import os
  import pyarrow as pa
  import lance
  import daft
  
  TABLE_NAME = "my_table"
  data = {
      "vector": [[1.1, 1.2], [0.2, 1.8]], 
      "lat": [45.5, 40.1], 
      "long": [-122.7, -74.1], 
      "id": [1, 2]
  }
  
  with tempfile.TemporaryDirectory() as tmp_dir:
      lance_path = os.path.join(tmp_dir, TABLE_NAME)
      
      arrow_table = pa.Table.from_pydict(data)
      lance.write_dataset(arrow_table, lance_path)
      daft_df = daft.read_lance(lance_path)
      
      # This works correctly
      daft_df.filter("id = 1").show(1)
      
      # This should show 1 row but shows none
      daft_df.filter("id = 2").show(1)
      
      # This works correctly when limit is larger than result count
      daft_df.filter("id = 2").show(2)
╭───────────────┬─────────┬─────────┬───────╮
│ vector        ┆ lat     ┆ long    ┆ id    │
│ ---           ┆ ---     ┆ ---     ┆ ---   │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [1.1, 1.2]    ┆ 45.5    ┆ -122.7  ┆ 1     │
╰───────────────┴─────────┴─────────┴───────╯
(Showing first 1 rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vector        ┆ lat     ┆ long    ┆ id    │
│ ---           ┆ ---     ┆ ---     ┆ ---   │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
╰───────────────┴─────────┴─────────┴───────╯
(No data to display: Materialized dataframe has no rows)
╭───────────────┬─────────┬─────────┬───────╮
│ vector        ┆ lat     ┆ long    ┆ id    │
│ ---           ┆ ---     ┆ ---     ┆ ---   │
│ List[Float64] ┆ Float64 ┆ Float64 ┆ Int64 │
╞═══════════════╪═════════╪═════════╪═══════╡
│ [0.2, 1.8]    ┆ 40.1    ┆ -74.1   ┆ 2     │Expected behavior
Expected output:
- All three filter operations should return the matching row when id=1 or id=2
 - The second show(1) should display the row with id=2
 
Actual output:
- The first filter (id=1) with show(1) works correctly
 - The second filter (id=2) with show(1) shows "No data to display"
 - The third filter (id=2) with show(2) works correctly
 
Component(s)
Expressions
Additional context
This behavior only occurs when reading from Lance datasets. When using other data formats with Daft, the filtering and show() behavior works as expected, applying the filter first before limiting results in show().
Environment:
daft: latest version
lance: latest version
pyarrow: compatible version
It seems the limit parameter in show() is being applied before the filter when working with Lance datasets, which is the opposite of the expected behavior.