fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

EdAbati · 2025-05-04T20:46:31Z

What type of PR is this? (check all applicable)

Related issues

Closes error reporting: unify "column not found" error message for DuckDB / spark-like #2472

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

EdAbati · 2025-05-04T21:13:43Z

I think I can make some other clean-up of repetitive code. I'll try tomorrow morning

EdAbati · 2025-05-05T07:04:48Z

I made a followup PR #2495 with the cleanup :)

MarcoGorelli

thanks for working on this! just got a comment on the .columns usage

MarcoGorelli · 2025-05-05T07:29:55Z

narwhals/_duckdb/expr.py

@@ -186,7 +187,14 @@ def from_column_names(
        context: _FullContext,
    ) -> Self:
        def func(df: DuckDBLazyFrame) -> list[duckdb.Expression]:
-            return [col(name) for name in evaluate_column_names(df)]
+            col_names = evaluate_column_names(df)
+            missing_columns = [c for c in col_names if c not in df.columns]


df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

I was hoping we could do something like we do for Polars. That is to say, when we do select / with_columns, we wrap them in try/except, and in the except block we intercept the error message to give a more useful / unified one

Ah interesting, I was not aware 😕

What is happening in the background in duckdb that causes this overhead ? Do you have a link to the docs? (Just want to learn more)

Also, is it a specific caveat of duckdb? I don't think we should worry about that in spark-like but I might be wrong

I will update the code tonight anyway (but of course feel free to add commits to this branch if you need it for today's release)

df.columns comes with overhead unfortunately, I think we should avoid calling it where possible. How much overhead depends on the operation

@MarcoGorelli could we add that to (#805) and put more of a focus towards it? 🙏

I don't think it's documented, but evaluating .columns may sometimes require doing a full scan. Example:

In [48]: df = pl.DataFrame({'a': rng.integers(0, 10_000, 100_000_000), 'b': rng.integers(0, 10_000, 100_000_000)}) In [49]: rel = duckdb.table('df') 100% ▕████████████████████████████████████████████████████████████▏ In [50]: rel1 = duckdb.sql("""pivot rel on a""") In [51]: %timeit rel.columns 385 ns ± 7.62 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [52]: %timeit rel1.columns 585 μs ± 3.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Granted, we don't have pivot in the Narwhals lazy API, but a pivot may appear in the history of the relation which someone passes to nw.from_native, and the output schema of pivot is value-dependent (😩 )

The same consideration should apply to spark-like

How do those timings compare to other operations/metadata lookups on the same tables?

.alias for example is completely non-value-dependent, so that stays fast

In [60]: %timeit rel.alias 342 ns ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) In [61]: %timeit rel1.alias 393 ns ± 2.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

narwhals/_spark_like/dataframe.py

EdAbati · 2025-05-06T06:12:47Z

narwhals/_spark_like/dataframe.py

+            try:
+                return self._with_native(self.native.select(*new_columns_list))
+            except AnalysisException as e:
+                msg = f"Selected columns not found in the DataFrame.\n\nHint: Did you mean one of these columns: {self.columns}?"


Not 100% sure about this error message. I don't we can access the missing column names at this level, am I missing something?

I think what you're written is great - even though we can't access them, we can still try to be helpful

…und-error

EdAbati · 2025-05-09T12:04:41Z

tests/frame/select_test.py

I split the test into lazy and eager to simplify a bit the if-else statements. I hope it is a bit more readable ?

EdAbati · 2025-05-09T12:05:28Z

tests/frame/select_test.py

+        return df
+
+    if constructor_id == "polars[lazy]":
+        msg = r"^e|\"(e|f)\""


Before it was msg = "e|f". Now it is a bit stricter

EdAbati · 2025-05-09T12:05:58Z

tests/frame/select_test.py

+    with pytest.raises(ColumnNotFoundError, match=msg):
+        df.select(nw.col("fdfa"))


before this was not tested for polars

EdAbati · 2025-05-09T12:08:06Z

tests/frame/select_test.py

+    constructor_lazy: ConstructorLazy, request: pytest.FixtureRequest
+) -> None:
+    constructor_id = str(request.node.callspec.id)
+    if any(id_ == constructor_id for id_ in ("sqlframe", "pyspark[connect]")):


sqlframe and pystpark.connect raise errors at collect. 😕

~~I need to double check pystpark.connect. Currently cannot set it up locally... Working on it ⏳~~

Do you have an idea on how to deal with these?

EdAbati · 2025-05-23T07:18:05Z

I finally had a few minutes to go back to this 🥲

Few notes:

catch_{}_exception made it cleaner thanks for the tip
sqlframe is tricky because it raises at collect time. Also the error will be different based on the backend. Do we have a way to know which type of sqlframe we are dealing with. For duckdb we could use the catch_duckdb_exception but we need to make sure duckdb is available. Any ideas? Could we think about it in a follow up?

regarding simple_select, it should already be covered by the below. We are already testing it too:

narwhals/narwhals/dataframe.py

Lines 165 to 181 in b5f72dd

    
           def select( 
        
               self, *exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr 
        
           ) -> Self: 
        
               flat_exprs = tuple(flatten(exprs)) 
        
               if flat_exprs and all(isinstance(x, str) for x in flat_exprs) and not named_exprs: 
        
                   # fast path! 
        
                   try: 
        
                       return self._with_compliant( 
        
                           self._compliant_frame.simple_select(*flat_exprs) 
        
                       ) 
        
                   except Exception as e: 
        
                       # Column not found is the only thing that can realistically be raised here. 
        
                       available_columns = self.columns 
        
                       missing_columns = [x for x in flat_exprs if x not in available_columns] 
        
                       raise ColumnNotFoundError.from_missing_and_available_column_names( 
        
                           missing_columns, available_columns 
        
                       ) from e

Would you like to do something else for lazy backends?

EdAbati

There are a couple of unrelated errors ~~I'll check later.~~

See #2593 and #2594 (thanks @MarcoGorelli )

…und-error

dangotbanned · 2025-05-24T14:26:30Z

sqlframe is tricky because it raises at collect time. Also the error will be different based on the backend. Do we have a way to know which type of sqlframe we are dealing with?

@EdAbati IIRC there's some import-related functions in _spark_like.utils that may be helpful for you?

I'm assuming this would work in a similar way, where you either need to use some Base* class or use the string in the concrete type in an import path

…und-error

EdAbati · 2025-06-12T06:37:57Z

Found some time to update this. (sorry for the late reply)

@EdAbati IIRC there's some import-related functions in _spark_like.utils that may be helpful for you?

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

@MarcoGorelli is there anything that you think is missing now? :)

MarcoGorelli · 2025-06-24T10:28:02Z

thanks! i think the logic looks right, the tests look a little complex but maybe that's ok. will get back to this shortly

MarcoGorelli

thanks @EdAbati ! looks good to me

just a couple of merge conflicts and a suggestion on the tests, but then i'd say we can ship it 🚢

MarcoGorelli · 2025-06-27T16:20:44Z

tests/conftest.py

+    elif "constructor_lazy" in metafunc.fixturenames:
+        metafunc.parametrize(
+            "constructor_lazy", lazy_constructors, ids=lazy_constructors_ids
+        )


I feel slightly uneasy about adding an extra constructor just for one test. and if we need to add it, then maybe constructors could be a union of eager_constructors and lazy_constructors, rather than making all 3?

if possible, i'd suggest to leave as-is for now and see if it's possible to use constructor in the test

I updated now. I thought test_missing_columns was going to be less readable, but it doesn't make a lot of difference.

MarcoGorelli · 2025-06-27T16:22:12Z

The problem IMO is that since sqlframe lets backend raise their errors we can only intercepts the ones of the backends we also support (i.e. pyspark and duckdb)
Not sure if the best solution would be to let sqlframe do their error handling or to intercept the errors just for the backend we support.
Maybe it should be discussed in an issue/follow-up?

Yeah that's fine - I think in general it's ok to aim for "we try to unify what we can, but there may be some differences that we have no control over"

…und-error

EdAbati

Sorry for the delay again 🥲 let me know if there is still something that looks off, I have some time today to update

and thank you for the review

EdAbati · 2025-07-08T06:45:13Z

tests/utils.py

@@ -42,6 +42,7 @@ def get_module_version_as_tuple(module_name: str) -> tuple[int, ...]:

 Constructor: TypeAlias = Callable[[Any], "NativeLazyFrame | NativeFrame | DataFrameLike"]
 ConstructorEager: TypeAlias = Callable[[Any], "NativeFrame | DataFrameLike"]
+ConstructorLazy: TypeAlias = Callable[[Any], "NativeLazyFrame"]


@MarcoGorelli do you think we should delete this too?
I think it actually makes the LAZY_CONSTRUCTORS: dict[str, ConstructorLazy] a bit more accurate/stricter

EdAbati added 3 commits May 4, 2025 22:45

unify ColumnNotFound

6f7a574

revert

8fe45e6

Merge branch 'main' into unify-column-not-found-error

45f09e0

EdAbati mentioned this pull request May 5, 2025

chore: always use check_columns_exist where possible #2495

Merged

10 tasks

EdAbati marked this pull request as ready for review May 5, 2025 07:04

EdAbati requested review from MarcoGorelli and dangotbanned May 5, 2025 07:09

MarcoGorelli requested changes May 5, 2025

View reviewed changes

EdAbati added 2 commits May 6, 2025 08:03

try except during select

6d58bc2

catch correct exception

be834b3

EdAbati added pyspark Issue is related to pyspark backend pyspark-connect error reporting labels May 6, 2025

EdAbati commented May 6, 2025

View reviewed changes

narwhals/_spark_like/dataframe.py Outdated Show resolved Hide resolved

EdAbati commented May 6, 2025

View reviewed changes

EdAbati added 7 commits May 6, 2025 08:13

coverage

19d6e24

Merge branch 'main' into unify-column-not-found-error

f0a9821

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

c1cabd1

…und-error

separate lazy and eager tests

917d073

coverage

5d2972d

cleanup exception

4080101

use constructor_id

8442596

EdAbati commented May 9, 2025

View reviewed changes

EdAbati changed the title ~~fix: unify ColumnNotFound for duckdb and pyspark/sqlframe~~ fix: unify ColumnNotFound for duckdb and pyspark May 9, 2025

what is going on in pyspark connect?

937a123

EdAbati added 3 commits May 23, 2025 08:51

catch_pyspark_column_not_found_exception

0841c6a

change signature

29441ef

testing if connect has the same error?

975ef48

EdAbati added 5 commits May 23, 2025 09:20

coverage happy

eae724d

fix spark connect

661655b

catch pyspark connect at collect

a2b2889

fixes

c858861

fix regex

7601ef1

EdAbati commented May 23, 2025

View reviewed changes

EdAbati added 5 commits May 23, 2025 13:15

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

977738f

…und-error

coverage happier

49c4806

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

76b209d

…und-error

Merge branch 'main' into unify-column-not-found-error

6e23961

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

dd7a757

…und-error

EdAbati added 6 commits May 26, 2025 08:25

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

cce992b

…und-error

fix drop test

67baf81

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

5d1cc71

…und-error

restore from_available_column_names

be7240f

Merge branch 'main' into unify-column-not-found-error

53021e8

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

fef9cb3

…und-error

Merge branch 'main' into unify-column-not-found-error

03cdad2

MarcoGorelli reviewed Jun 27, 2025

View reviewed changes

EdAbati added 2 commits July 8, 2025 08:04

Merge remote-tracking branch 'upstream/main' into unify-column-not-fo…

7c898e1

…und-error

remove constructor_lazy

5c7e114

EdAbati commented Jul 8, 2025

View reviewed changes

		with pytest.raises(ColumnNotFoundError, match=msg):
		df.select(nw.col("fdfa"))

fix: unify ColumnNotFound for duckdb and pyspark #2493

Are you sure you want to change the base?

fix: unify ColumnNotFound for duckdb and pyspark #2493

Uh oh!

Conversation

EdAbati commented May 4, 2025

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

EdAbati commented May 4, 2025

Uh oh!

EdAbati commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EdAbati commented May 23, 2025

Uh oh!

EdAbati left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented May 24, 2025

Uh oh!

EdAbati commented Jun 12, 2025

Uh oh!

MarcoGorelli commented Jun 24, 2025

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Jun 27, 2025

Uh oh!

EdAbati left a comment

Choose a reason for hiding this comment

Uh oh!

EdAbati Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

fix: unify `ColumnNotFound` for `duckdb` and `pyspark` #2493

EdAbati commented May 5, 2025 •

edited

Loading

EdAbati May 5, 2025 •

edited

Loading

EdAbati May 9, 2025 •

edited

Loading

EdAbati May 9, 2025 •

edited

Loading

EdAbati left a comment •

edited

Loading

EdAbati Jul 8, 2025 •

edited

Loading