-
Notifications
You must be signed in to change notification settings - Fork 142
fix: unify ColumnNotFound
for duckdb
and pyspark
#2493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I think I can make some other clean-up of repetitive code. I'll try tomorrow morning |
I made a followup PR #2495 with the cleanup :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this! just got a comment on the .columns
usage
narwhals/_duckdb/expr.py
Outdated
@@ -186,7 +187,14 @@ def from_column_names( | |||
context: _FullContext, | |||
) -> Self: | |||
def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]: | |||
return [col(name) for name in evaluate_column_names(df)] | |||
col_names = evaluate_column_names(df) | |||
missing_columns = [c for c in col_names if c not in df.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.columns
comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
I was hoping we could do something like we do for Polars. That is to say, when we do select
/ with_columns
, we wrap them in try/except
, and in the except
block we intercept the error message to give a more useful / unified one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah interesting, I was not aware π
What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)
Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong
I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.columns
comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation
@MarcoGorelli could we add that to (#805) and put more of a focus towards it? π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's documented, but evaluating .columns
may sometimes require doing a full scan. Example:
In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})
In [49]: rel = duckdb.table('df')
100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
In [50]: rel1 = duckdb.sql("""pivot rel on a""")
In [51]: %timeit rel.columns
385 ns Β± 7.62 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
In [52]: %timeit rel1.columns
585 ΞΌs Β± 3.8 ΞΌs per loop (mean Β± std. dev. of 7 runs, 1,000 loops each)
Granted, we don't have pivot
in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native
, and the output schema of pivot
is value-dependent (π© )
The same consideration should apply to spark-like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do those timings compare to other operations/metadata lookups on the same tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.alias
for example is completely non-value-dependent, so that stays fast
In [60]: %timeit rel.alias
342 ns Β± 2.3 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
In [61]: %timeit rel1.alias
393 ns Β± 2.6 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)
narwhals/_spark_like/dataframe.py
Outdated
try: | ||
return self._with_native(self.native.select(*new_columns_list)) | ||
except AnalysisException as e: | ||
msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're written is great - even though we can't access them, we can still try to be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?
return df | ||
|
||
if constructor_id == "polars[lazy]": | ||
msg = r"^e|\"(e|f)\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before it was msg = "e|f"
. Now it is a bit stricter
with pytest.raises(ColumnNotFoundError, match=msg): | ||
df.select(nw.col("fdfa")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this was not tested for polars
constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest | ||
) -> None: | ||
constructor_id = str(request.node.callspec.id) | ||
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlframe
and pystpark.connect
raise errors at collect. π
I need to double check pystpark.connect
. Currently cannot set it up locally... Working on it β³
Do you have an idea on how to deal with these?
ColumnNotFound
for duckdb
and pyspark
/sqlframe
ColumnNotFound
for duckdb
and pyspark
try: | ||
return self._with_native(self.native.select(*selection)) | ||
except duckdb.BinderException as e: | ||
raise ColumnNotFoundError.from_available_column_names(self.columns) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we risk catching other errors with this? BinderException
might be a bit broad, shall we also match on str(e)
before raising ColumnNotFoundError
?
we should probably also do this for:
- with_columns
- simple_select
- filter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli this has reminded me about an issue with SparkLikeLazyFrame
narwhals/narwhals/_spark_like/dataframe.py
Lines 152 to 164 in b7001e4
def _to_arrow_schema(self) -> pa.Schema: # pragma: no cover | |
import pyarrow as pa # ignore-banned-import | |
from narwhals._arrow.utils import narwhals_to_native_dtype | |
schema: list[tuple[str, pa.DataType]] = [] | |
nw_schema = self.collect_schema() | |
native_schema = self.native.schema | |
for key, value in nw_schema.items(): | |
try: | |
native_dtype = narwhals_to_native_dtype(value, self._version) | |
except Exception as exc: # noqa: BLE001,PERF203 | |
native_spark_dtype = native_schema[key].dataType # type: ignore[index] |
This one is a bigger problem because it captures CTRL+C
, so you can't easily stop the test suite while it's running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @MarcoGorelli !
Regarding, simple_select
we are already catching a blind Exception
:
narwhals/narwhals/dataframe.py
Lines 168 to 180 in 2a16835
if flat_exprs and all(isinstance(x, str) for x in flat_exprs) and not named_exprs: | |
# fast path! | |
try: | |
return self._with_compliant( | |
self._compliant_frame.simple_select(*flat_exprs) | |
) | |
except Exception as e: | |
# Column not found is the only thing that can realistically be raised here. | |
available_columns = self.columns | |
missing_columns = [x for x in flat_exprs if x not in available_columns] | |
raise ColumnNotFoundError.from_missing_and_available_column_names( | |
missing_columns, available_columns | |
) from e |
And should already be caught by the first test:
narwhals/tests/frame/select_test.py
Lines 104 to 109 in 2a16835
if isinstance(df, nw.LazyFrame): | |
with pytest.raises(ColumnNotFoundError, match=msg): | |
df.select(selected_columns).collect() | |
else: | |
with pytest.raises(ColumnNotFoundError, match=msg): | |
df.select(selected_columns) |
Maybe we should have a _missing_column_exception
property in BaseFrame
that should be specified for each backend. So here we can catch the exact exception here.
What do you think? I could do that in a separate PR to see how it looks like
Regarding filter
and with_column
, I think we are not testing what happens if we use not existing columns in these methods yet, or am I missing something? I can make another small PR for just this
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below