-
Notifications
You must be signed in to change notification settings - Fork 686
Description
What happened?
On the DuckDB backend, using a window function followed by filter() and drop() can produce SQL that still returns the dropped column via SELECT <alias>.*.
DuckDB then returns more columns than Ibis expects, and execution fails with:
ValueError: schema names don't match input data columns
Minimal Reproducible Example
import ibis
from ibis import _
import datetime as dt
con = ibis.duckdb.connect()
# base table
con.create_table(
"contrib",
ibis.memtable(
[
{"contact_id": 1, "contribution_id": "c1", "dt": dt.date(2024, 1, 3), "part_of_mandate": False},
{"contact_id": 1, "contribution_id": "c2", "dt": dt.date(2024, 1, 2), "part_of_mandate": False},
{"contact_id": 2, "contribution_id": "c3", "dt": dt.date(2024, 1, 2), "part_of_mandate": False},
],
schema={
"contact_id": "int64",
"contribution_id": "string",
"dt": "date",
"part_of_mandate": "boolean",
},
),
overwrite=True,
)
# joined table
con.create_table(
"comm",
ibis.memtable(
[
{"contact_id": 1, "sent": dt.datetime(2024, 1, 1, 0, 0), "campaign_code": "X"},
{"contact_id": 2, "sent": dt.datetime(2024, 1, 1, 0, 0), "campaign_code": "X"},
],
schema={
"contact_id": "int64",
"sent": "timestamp",
"campaign_code": "string",
},
),
overwrite=True,
)
# pipeline
t = (
con.table("contrib")
.mutate(uid=ibis.row_number().over(order_by=[_.contribution_id]) - 1)
.filter(~_.part_of_mandate)
)
t = (
t.join(con.table("comm"), t.contact_id == _.contact_id)
.filter(_.sent <= _.dt)
)
q = (
t.mutate(
rn=ibis.row_number().over(
group_by=[_.uid],
order_by=[_.sent.desc()],
) - 1
)
.filter(_.rn == 0)
.drop(_.rn)
)
q.execute() # ❌ ValueError: schema names don't match input data columnsExpected behavior
q.execute() should succeed and return all columns except the temporary window column rn.
Actual behavior
Execution fails with:
ValueError: schema names don't match input data columns
Debugging shows:
q.schema()does not includernDESCRIBE SELECT * FROM (<compiled query>)does includern- the compiled SQL ends with
SELECT <alias>.*, so the dropped column is still returned
Debug evidence
print(q.schema())
print(con.raw_sql(f"DESCRIBE SELECT * FROM ({q.compile()}) q").fetchall())Environment
ibis-framework == 11.0.0
duckdb == 1.4.3
pyarrow == 22.0.0
pyarrow-hotfix == 0.7
python == 3.x
Workaround
Force an explicit final projection instead of relying on drop():
cols = [c for c in q.columns if c != "rn"]
q = q.select([_[c] for c in cols])Notes
- This did not fail with older versions of the packages.
- The issue appears specific to window functions combined with
drop()on the DuckDB backend. - DuckDB behaves correctly given the generated SQL; this appears to be an Ibis backend codegen issue.
What version of ibis are you using?
├── ibis-framework[duckdb, mysql, polars] v11.0.0
│ ├── duckdb v1.4.3 (extra: duckdb)
│ ├── numpy v2.3.5 (extra: duckdb)
│ ├── packaging v25.0 (extra: duckdb)
│ ├── pandas v2.3.3 (extra: duckdb)
│ ├── pyarrow v22.0.0 (extra: duckdb)
│ ├── pyarrow-hotfix v0.7 (extra: duckdb)
│ ├── rich v14.2.0 (extra: duckdb)
What backend(s) are you using, if any?
duckdb
Relevant log output
>>> %run -- "/home/oliver/git/ibis_testing/ibis_MWE.py"
ValueError: schema names don't match input data columns
File ~/git/ibis_testing/ibis_MWE.py:71
64 q = (
65 t.mutate(rn=ibis.row_number().over(group_by=[_.uid], order_by=[_.sent.desc()]))
66 .filter(_.rn == 0)
67 .drop(_.rn)
68 )
70 q.compile()
---> 71 q.execute() # ❌ ValueError: schema names don't match input data columns (rn returned despite drop)
Show Traceback
>>> q.compile()
'SELECT "t9".* FROM (SELECT * FROM (SELECT "t7"."contact_id", "t7"."contribution_id", "t7"."dt", "t7"."part_of_mandate", "t7"."uid", "t7"."contact_id_right", "t7"."sent", "t7"."campaign_code", ROW_NUMBER() OVER (PARTITION BY "t7"."uid" ORDER BY "t7"."sent" DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 1 AS "rn" FROM (SELECT * FROM (SELECT "t5"."contact_id", "t5"."contribution_id", "t5"."dt", "t5"."part_of_mandate", "t5"."uid", "t2"."contact_id" AS "contact_id_right", "t2"."sent", "t2"."campaign_code" FROM (SELECT * FROM (SELECT "t1"."contact_id", "t1"."contribution_id", "t1"."dt", "t1"."part_of_mandate", ROW_NUMBER() OVER (ORDER BY "t1"."contribution_id" ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 1 AS "uid" FROM "contrib" AS "t1") AS "t3" WHERE NOT ("t3"."part_of_mandate")) AS "t5" INNER JOIN "comm" AS "t2" ON "t5"."contact_id" = "t5"."contact_id") AS "t6" WHERE "t6"."sent" <= "t6"."dt") AS "t7") AS "t8" WHERE "t8"."rn" = 0) AS "t9"'Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status