Skip to content

bug: DuckDB backend returns dropped window column after drop(), causing Arrow schema mismatch #11801

@glilienthal

Description

@glilienthal

What happened?

On the DuckDB backend, using a window function followed by filter() and drop() can produce SQL that still returns the dropped column via SELECT <alias>.*.

DuckDB then returns more columns than Ibis expects, and execution fails with:

ValueError: schema names don't match input data columns

Minimal Reproducible Example

import ibis
from ibis import _
import datetime as dt

con = ibis.duckdb.connect()

# base table
con.create_table(
    "contrib",
    ibis.memtable(
        [
            {"contact_id": 1, "contribution_id": "c1", "dt": dt.date(2024, 1, 3), "part_of_mandate": False},
            {"contact_id": 1, "contribution_id": "c2", "dt": dt.date(2024, 1, 2), "part_of_mandate": False},
            {"contact_id": 2, "contribution_id": "c3", "dt": dt.date(2024, 1, 2), "part_of_mandate": False},
        ],
        schema={
            "contact_id": "int64",
            "contribution_id": "string",
            "dt": "date",
            "part_of_mandate": "boolean",
        },
    ),
    overwrite=True,
)

# joined table
con.create_table(
    "comm",
    ibis.memtable(
        [
            {"contact_id": 1, "sent": dt.datetime(2024, 1, 1, 0, 0), "campaign_code": "X"},
            {"contact_id": 2, "sent": dt.datetime(2024, 1, 1, 0, 0), "campaign_code": "X"},
        ],
        schema={
            "contact_id": "int64",
            "sent": "timestamp",
            "campaign_code": "string",
        },
    ),
    overwrite=True,
)

# pipeline
t = (
    con.table("contrib")
    .mutate(uid=ibis.row_number().over(order_by=[_.contribution_id]) - 1)
    .filter(~_.part_of_mandate)
)

t = (
    t.join(con.table("comm"), t.contact_id == _.contact_id)
     .filter(_.sent <= _.dt)
)

q = (
    t.mutate(
        rn=ibis.row_number().over(
            group_by=[_.uid],
            order_by=[_.sent.desc()],
        ) - 1
    )
    .filter(_.rn == 0)
    .drop(_.rn)
)

q.execute()  # ❌ ValueError: schema names don't match input data columns

Expected behavior

q.execute() should succeed and return all columns except the temporary window column rn.


Actual behavior

Execution fails with:

ValueError: schema names don't match input data columns

Debugging shows:

  • q.schema() does not include rn
  • DESCRIBE SELECT * FROM (<compiled query>) does include rn
  • the compiled SQL ends with SELECT <alias>.*, so the dropped column is still returned

Debug evidence

print(q.schema())
print(con.raw_sql(f"DESCRIBE SELECT * FROM ({q.compile()}) q").fetchall())

Environment

ibis-framework == 11.0.0
duckdb         == 1.4.3
pyarrow        == 22.0.0
pyarrow-hotfix == 0.7
python         == 3.x

Workaround

Force an explicit final projection instead of relying on drop():

cols = [c for c in q.columns if c != "rn"]
q = q.select([_[c] for c in cols])

Notes

  • This did not fail with older versions of the packages.
  • The issue appears specific to window functions combined with drop() on the DuckDB backend.
  • DuckDB behaves correctly given the generated SQL; this appears to be an Ibis backend codegen issue.

What version of ibis are you using?

├── ibis-framework[duckdb, mysql, polars] v11.0.0
│ ├── duckdb v1.4.3 (extra: duckdb)
│ ├── numpy v2.3.5 (extra: duckdb)
│ ├── packaging v25.0 (extra: duckdb)
│ ├── pandas v2.3.3 (extra: duckdb)
│ ├── pyarrow v22.0.0 (extra: duckdb)
│ ├── pyarrow-hotfix v0.7 (extra: duckdb)
│ ├── rich v14.2.0 (extra: duckdb)

What backend(s) are you using, if any?

duckdb

Relevant log output

>>> %run -- "/home/oliver/git/ibis_testing/ibis_MWE.py"
ValueError: schema names don't match input data columns
File ~/git/ibis_testing/ibis_MWE.py:71
     64 q = (
     65     t.mutate(rn=ibis.row_number().over(group_by=[_.uid], order_by=[_.sent.desc()]))
     66     .filter(_.rn == 0)
     67     .drop(_.rn)
     68 )
     70 q.compile()
---> 71 q.execute()  # ❌ ValueError: schema names don't match input data columns (rn returned despite drop)
Show Traceback



>>> q.compile()
'SELECT "t9".* FROM (SELECT * FROM (SELECT "t7"."contact_id", "t7"."contribution_id", "t7"."dt", "t7"."part_of_mandate", "t7"."uid", "t7"."contact_id_right", "t7"."sent", "t7"."campaign_code", ROW_NUMBER() OVER (PARTITION BY "t7"."uid" ORDER BY "t7"."sent" DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 1 AS "rn" FROM (SELECT * FROM (SELECT "t5"."contact_id", "t5"."contribution_id", "t5"."dt", "t5"."part_of_mandate", "t5"."uid", "t2"."contact_id" AS "contact_id_right", "t2"."sent", "t2"."campaign_code" FROM (SELECT * FROM (SELECT "t1"."contact_id", "t1"."contribution_id", "t1"."dt", "t1"."part_of_mandate", ROW_NUMBER() OVER (ORDER BY "t1"."contribution_id" ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 1 AS "uid" FROM "contrib" AS "t1") AS "t3" WHERE NOT ("t3"."part_of_mandate")) AS "t5" INNER JOIN "comm" AS "t2" ON "t5"."contact_id" = "t5"."contact_id") AS "t6" WHERE "t6"."sent" <= "t6"."dt") AS "t7") AS "t8" WHERE "t8"."rn" = 0) AS "t9"'

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior inside of ibis

    Type

    No type

    Projects

    Status

    backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions