Skip to content

fix: unify ColumnNotFound for duckdb and pyspark #2493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

EdAbati
Copy link
Collaborator

@EdAbati EdAbati commented May 4, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@EdAbati
Copy link
Collaborator Author

EdAbati commented May 4, 2025

I think I can make some other clean-up of repetitive code. I'll try tomorrow morning

@EdAbati EdAbati marked this pull request as ready for review May 5, 2025 07:04
@EdAbati
Copy link
Collaborator Author

EdAbati commented May 5, 2025

I made a followup PR #2495 with the cleanup :)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this! just got a comment on the .columns usage

@@ -186,7 +187,14 @@ def from_column_names(
context: _FullContext,
) -> Self:
def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]:
return [col(name) for name in evaluate_column_names(df)]
col_names = evaluate_column_names(df)
missing_columns = [c for c in col_names if c not in df.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one

Copy link
Collaborator Author

@EdAbati EdAbati May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, I was not aware πŸ˜•

What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)

Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong

I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

@MarcoGorelli could we add that to (#805) and put more of a focus towards it? πŸ™

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:

In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)})

In [49]: rel = duckdb.table('df')
100% β–•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–

In [50]: rel1 = duckdb.sql("""pivot rel on a""")

In [51]: %timeit rel.columns
385 ns Β± 7.62 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

In [52]: %timeit rel1.columns
585 ΞΌs Β± 3.8 ΞΌs per loop (mean Β± std. dev. of 7 runs, 1,000 loops each)

Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )

The same consideration should apply to spark-like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do those timings compare to other operations/metadata lookups on the same tables?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.alias for example is completely non-value-dependent, so that stays fast

In [60]: %timeit rel.alias
342 ns Β± 2.3 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

In [61]: %timeit rel1.alias
393 ns Β± 2.6 ns per loop (mean Β± std. dev. of 7 runs, 1,000,000 loops each)

@EdAbati EdAbati added pyspark Issue is related to pyspark backend pyspark-connect error reporting labels May 6, 2025
try:
return self._with_native(self.native.select(*new_columns_list))
except AnalysisException as e:
msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you're written is great - even though we can't access them, we can still try to be helpful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?

return df

if constructor_id == "polars[lazy]":
msg = r"^e|\"(e|f)\""
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was msg = "e|f". Now it is a bit stricter

Comment on lines +105 to +106
with pytest.raises(ColumnNotFoundError, match=msg):
df.select(nw.col("fdfa"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this was not tested for polars

constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest
) -> None:
constructor_id = str(request.node.callspec.id)
if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")):
Copy link
Collaborator Author

@EdAbati EdAbati May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlframe and pystpark.connect raise errors at collect. πŸ˜•

I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳

Do you have an idea on how to deal with these?

@EdAbati EdAbati changed the title fix: unify ColumnNotFound for duckdb and pyspark/sqlframe fix: unify ColumnNotFound for duckdb and pyspark May 9, 2025
Comment on lines +187 to +190
try:
return self._with_native(self.native.select(*selection))
except duckdb.BinderException as e:
raise ColumnNotFoundError.from_available_column_names(self.columns) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we risk catching other errors with this? BinderException might be a bit broad, shall we also match on str(e) before raising ColumnNotFoundError?

we should probably also do this for:

  • with_columns
  • simple_select
  • filter

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli this has reminded me about an issue with SparkLikeLazyFrame

def _to_arrow_schema(self) -> pa.Schema: # pragma: no cover
import pyarrow as pa # ignore-banned-import
from narwhals._arrow.utils import narwhals_to_native_dtype
schema: list[tuple[str, pa.DataType]] = []
nw_schema = self.collect_schema()
native_schema = self.native.schema
for key, value in nw_schema.items():
try:
native_dtype = narwhals_to_native_dtype(value, self._version)
except Exception as exc: # noqa: BLE001,PERF203
native_spark_dtype = native_schema[key].dataType # type: ignore[index]

This one is a bigger problem because it captures CTRL+C, so you can't easily stop the test suite while it's running

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @MarcoGorelli !

Regarding, simple_select we are already catching a blind Exception:

if flat_exprs and all(isinstance(x, str) for x in flat_exprs) and not named_exprs:
# fast path!
try:
return self._with_compliant(
self._compliant_frame.simple_select(*flat_exprs)
)
except Exception as e:
# Column not found is the only thing that can realistically be raised here.
available_columns = self.columns
missing_columns = [x for x in flat_exprs if x not in available_columns]
raise ColumnNotFoundError.from_missing_and_available_column_names(
missing_columns, available_columns
) from e

And should already be caught by the first test:

if isinstance(df, nw.LazyFrame):
with pytest.raises(ColumnNotFoundError, match=msg):
df.select(selected_columns).collect()
else:
with pytest.raises(ColumnNotFoundError, match=msg):
df.select(selected_columns)

Maybe we should have a _missing_column_exception property in BaseFrame that should be specified for each backend. So here we can catch the exact exception here.
What do you think? I could do that in a separate PR to see how it looks like

Regarding filter and with_column, I think we are not testing what happens if we use not existing columns in these methods yet, or am I missing something? I can make another small PR for just this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error reporting pyspark Issue is related to pyspark backend pyspark-connect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error reporting: unify "column not found" error message for DuckDB / spark-like
3 participants