Skip to content

refactor: Simplify Implementation.to_native_namespace #2620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented May 26, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Not sure how likable this is (see discord)

@dangotbanned
Copy link
Member

Not sure how likable this is

The phrasing of this tickled me 😂

@FBruzzesi what if we solved this and closed a related issue in one go?

We'd use what you have as lambda, but define them as import_pandas, import_polars, etc.
And then wrap with a nice error message?

We'd also be able to get typing working (at least for the individual cases)

@FBruzzesi
Copy link
Member Author

The phrasing of this tickled me 😂

I personally don't mind but who knows you guys don't have strong opinions on these hacks 😂

@FBruzzesi what if we solved this and closed a related issue in one go?

Sure, the polars implementation seems easy to adapt as well. What I am not too keen on is that MIN_VERSIONS is defined based on implementation, so that feels a bit circular, but definitly fixable 😂

We'd also be able to get typing working (at least for the individual cases)

To my knowledge, ModuleType is not a generic, so how would that work? i.e. what should the following simplified version return?

@overload
def import_optional(module_name: Literal["pandas"]) -> ?
    return __import__(module_name)

@dangotbanned
Copy link
Member

dangotbanned commented May 26, 2025

The phrasing of this tickled me 😂

I personally don't mind but who knows you guys don't have strong opinions on these hacks 😂

@FBruzzesi what if we solved this and closed a related issue in one go?

Sure, the polars implementation seems easy to adapt as well. What I am not too keen on is that MIN_VERSIONS is defined based on implementation, so that feels a bit circular, but definitly fixable 😂

We'd also be able to get typing working (at least for the individual cases)

To my knowledge, ModuleType is not a generic, so how would that work? i.e. what should the following simplified version return?

@overload
def import_optional(module_name: Literal["pandas"]) -> ?
    return __import__(module_name)

At the end of (#1940 (comment)) ...

So you leave each import_* function return un-annotated, and let the Module(...) part get inferred 😎

Image

Then you can have a mapping like:

# or replace `Implementation` with the literal you resolve in `Implementation.value`
_IMPORT_FUNCTIONS: Mapping[Implementation, Callable[..., ModuleType]] = {
    Implementation.PANDAS: import_pandas,
    Implementation.POLARS: import_polars,
    ...: ...,
}

#2620 (comment)

We'd also be able to get typing working (at least for the individual cases)

So in isolation we get the typing (import_polars()), but not when we do the redirect through a mapping (sadly 😢)

@dangotbanned dangotbanned marked this pull request as ready for review May 31, 2025 16:26
@dangotbanned
Copy link
Member

The only way I've found to fix this was by removing the sys.modules monkeypatching.
But I don't really like that as it means find_spec isn't being tested against the case where the module hasn't been imported yet 🤔

https://github.com/narwhals-dev/narwhals/actions/runs/15491689078/job/43618476234?pr=2620

@FBruzzesi
Copy link
Member Author

@MarcoGorelli @dangotbanned I tried to implement the idea I had mentioned in discord in 96540c3, yet apparently everything breaks apart with ibis, so I am happy to revert unless you guys have some other ideas.

The reason to use from importlib.metadata import version was mostly because sqlframe does not have __version__ that we can access (that's what polars does).
However,

  • import_module("ibis") -> ok
  • import_module("ibis-framework") -> errors
  • version("ibis") -> errors
  • version("ibis-framework") -> ok

@dangotbanned
Copy link
Member

@FBruzzesi here's a platter of thoughts 😄

This PR

importlib

There's a very new PEP 794 – Import Name Metadata that seems relevant to the issue you had here

  • version("ibis") -> errors
  • import_module("ibis-framework") -> errors

ibis-framework is the name of the distribution, whereas ibis is the name of the package.
They're usually the same, but sadly not in this case 😞

https://github.com/python/cpython/blob/24069fbca861a5904ee7718469919e84828f22e7/Lib/importlib/metadata/__init__.py#L980-L987
https://github.com/python/cpython/blob/24069fbca861a5904ee7718469919e84828f22e7/Lib/importlib/__init__.py#L71-L88

__version__

The reason to use from importlib.metadata import version was mostly because sqlframe does not have __version__ that we can access (that's what polars does).

So this detail is already handled in here, you might notice it starts with a call to to_native_namespace

Implementation._backend_version

narwhals/narwhals/utils.py

Lines 617 to 635 in deaaf76

def _backend_version(self) -> tuple[int, ...]:
native = self.to_native_namespace()
into_version: Any
if self not in {
Implementation.PYSPARK,
Implementation.PYSPARK_CONNECT,
Implementation.DASK,
Implementation.SQLFRAME,
}:
into_version = native
elif self in {Implementation.PYSPARK, Implementation.PYSPARK_CONNECT}:
into_version = get_pyspark() # pragma: no cover
elif self is Implementation.DASK:
into_version = get_dask()
else:
import sqlframe._version
into_version = sqlframe._version
return parse_version(into_version)

Minimum versions (generally)

I totally forgot to follow up on (https://discord.com/channels/1235257048170762310/1276640597284884490/1379470754051002389)

I'll continue this over on that other issue so you've got something to come back to

We have two other methods for enforcing version constraints:

If there is another use-case these don't cover, I think it would make the most sense to improve those tools.
This PR would be adding a third way of handling package versions - leading to 3 different error messages for the same issue 😔

@FBruzzesi
Copy link
Member Author

Thanks @dangotbanned, I am aware of all the code parts you mentioned.

I think import_optional_dependency is a much more generic function, for a much more generic use case - at this point we should restrict to be able to import and validate the backends we support, at least for now to address to_native_namespace.

utils.validate_backend_version

Yes, but it requires and implementation & the version already

So this detail is already handled in here, you might notice it starts with a call to to_native_namespace

We can abstract the body of this


I guess we can proceed with one of the two following options:

  • Revert to backend specific import functions, but add version validation
  • Restrict import_optional_dependency to the backends we have, to deal with case by case if needed

in this way we could be able to re-use the above mentioned functions/methods

@FBruzzesi
Copy link
Member Author

@dangotbanned I am quite happy with the state of this now. Let me know your thoughts whenever you have the time 🤞🏼

@FBruzzesi FBruzzesi requested a review from dangotbanned June 8, 2025 16:42
Copy link
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @FBruzzesi sorry this one took a little longer than usual

Comment on lines 577 to 580
from importlib.metadata import version

into_version = sqlframe._version
return parse_version(into_version)
distribution_name = _IMPLEMENTATION_TO_DISTRIBUTION_NAME.get(self, self.value)
return parse_version(version=version(distribution_name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might end up triggering this downstream issue again with importlib.metadata.version:

It does have a solution, but I guess we'd need to keep the metadata requirements in sync somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can revert to use __version__ attribute where applicable, _version for sqlframe, and from importlib.metadata import version only for ibis (I couldn't spot an attribute storing the library version)

As we discussed in the past, __version__ is not a standard, so not guaranteed

Copy link
Member

@dangotbanned dangotbanned Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in the past, __version__ is not a standard, so not guaranteed

Yeah that's a good point, personally I'd want use to be using importlib.metadata.version for that reason.
I wonder if the pyinstaller folks could collect metadata for all dependencies in our pyproject.toml?

It does seem a bit off to me that we'd need to do things in a non-standardized way for what I can only assume is a downstream perf optimization 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main "issue" is that it would require to import the module in the first place, which is somewhat a looping with to_native_namespace 🤔

An idea would be to move the following block into _backend_version:

if self is Implementation.UNKNOWN:
    msg = "Cannot return native namespace from UNKNOWN Implementation"
    raise AssertionError(msg)

from importlib import import_module

module_name = _IMPLEMENTATION_TO_MODULE_NAME.get(self, self.value)
module = import_module(module_name)

and cache the module once loaded to be reused in to_native_namespace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned is from importlib import import_module ok to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and cache the module once loaded to be reused in to_native_namespace

I'm definitely in favor of adding some caching into the mix somewhere (#1657 (reply in thread)), we're checking backend_version waaaaaaay too many times for something that doesn't change.
But I think we can look at that in another issue.

and from importlib.metadata import version only for ibis (I couldn't spot an attribute storing the library version)

ibis seems to have always had a __version__?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dangotbanned is from importlib import import_module ok to use?

Yeah should be - you'll get the same ModuleNotFoundError if it isn't installed

>>> import ibis2
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[32], line 1
----> 1 import ibis2

ModuleNotFoundError: No module named 'ibis2'
from importlib import import_module

>>> import_module("ibis2")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[31], line 3
      1 from importlib import import_module
----> 3 import_module("ibis2")

File ~\AppData\Roaming\uv\python\cpython-3.13.2-windows-x86_64-none\Lib\importlib\__init__.py:88, in import_module(name, package)
     86             break
     87         level += 1
---> 88 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1387, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1360, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1324, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'ibis2'

Comment on lines 29 to 60
@pytest.mark.parametrize(
("impl", "frame_constructor", "frame_type", "kwargs"),
[
(Implementation.CUDF, "DataFrame", "DataFrame", {}),
(Implementation.DASK, "from_dict", "DataFrame", {"npartitions": 1}),
(Implementation.MODIN, "DataFrame", "DataFrame", {}),
(Implementation.PANDAS, "DataFrame", "DataFrame", {}),
(Implementation.POLARS, "DataFrame", "DataFrame", {}),
(Implementation.PYARROW, "table", "Table", {}),
],
)
def test_round_trip(
monkeypatch: pytest.MonkeyPatch,
impl: Implementation,
frame_constructor: str,
frame_type: str,
kwargs: dict[str, Any],
) -> None:
module_name = impl.value
if not find_spec(module_name):
reason = f"{module_name} not installed"
pytest.skip(reason=reason)

if impl.is_pyarrow() or (sys.version_info >= (3, 10) and impl.is_pandas()):
# NOTE: Can't do `monkeypatch.delitem` safely for other implementations
monkeypatch.delitem(sys.modules, module_name)

module = impl.to_native_namespace()
df = getattr(module, frame_constructor)(data, **kwargs)
result = _roundtrip_query(df)

assert isinstance(result, getattr(module, frame_type))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous tests, I forgot to add the step that I was eyballing in my IDE - but was actually the motivation for them 🤦‍♂️

Previous tests

def test_import_polars(data: dict[str, Any]) -> None:
# NOTE: Can't do `monkeypatch.delitem` safely
# ImportError: PyO3 modules compiled for CPython 3.8 or older may only be initialized once per interpreter process
pytest.importorskip("polars")
df = dependencies.import_polars().DataFrame(data)
result = _roundtrip_query(df)
import polars as pl
assert isinstance(result, pl.DataFrame)
# NOTE: Don't understand this one
@pytest.mark.xfail(
reason="'dict' object has no attribute '_meta'.\n"
"dask/dataframe/dask_expr/_collection.py:609",
raises=AttributeError,
)
def test_import_dask(data: dict[str, Any]) -> None:
pytest.importorskip("dask")
df = dependencies.import_dask().DataFrame(data)
result = _roundtrip_query(df)
import dask.dataframe as dd
assert isinstance(result, dd.DataFrame)
def test_import_pandas(monkeypatch: pytest.MonkeyPatch, data: dict[str, Any]) -> None:
pytest.importorskip("pandas")
monkeypatch.delitem(sys.modules, "pandas")
df = dependencies.import_pandas().DataFrame(data)
result = _roundtrip_query(df)
import pandas as pd
assert isinstance(result, pd.DataFrame)
def test_import_pyarrow(monkeypatch: pytest.MonkeyPatch, data: dict[str, Any]) -> None:
pytest.importorskip("pyarrow")
monkeypatch.delitem(sys.modules, "pyarrow")
df = dependencies.import_pyarrow().table(data)
result = _roundtrip_query(df)
import pyarrow as pa
assert isinstance(result, pa.Table)

If we used polars as an example, I was trying to make sure the typing survived a roundtrip.

I could see that out on hover, but the test should've been:

def test_import_polars(data: dict[str, Any]) -> None: 
    pytest.importorskip("polars") 
    df = dependencies.import_polars().DataFrame(data) 
    result = _roundtrip_query(df) 
    import polars as pl

    if TYPE_CHECKING:
        from typing_extensions import assert_type
        
        assert_type(result, pl.DataFrame)
    assert isinstance(result, pl.DataFrame) 

That style is what I wrote in the ...

Namespace tests

def test_namespace_from_numpy_polars() -> None:
pytest.importorskip("polars")
pytest.importorskip("numpy")
import numpy as np
import polars as pl
arr: _2DArray = cast("_2DArray", np.array([[5, 2, 0, 1], [1, 4, 7, 8], [1, 2, 3, 9]]))
columns = "a", "b", "c"
frame = Namespace.from_backend("polars").compliant.from_numpy(arr, columns).native
if TYPE_CHECKING:
assert_type(frame, pl.DataFrame)
assert isinstance(frame, pl.DataFrame)

Since the PR doesn't have typing anymore, I'm not sure the test makes much sense as we're effectively just checking:

  • Does find_spec work?
  • Can we construct a dataframe via import_module?
  • Once we have a dataframe, can we use from_native and to_native?

I think the other tests you've added give us the coverage for the new behavior

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would you be ok with just leave the tests test_to_native_namespace and test_to_native_namespace_unknown?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah as long as they're all we need for coverage

@FBruzzesi
Copy link
Member Author

What's up with pyarrow nightly 🥲

@dangotbanned
Copy link
Member

What's up with pyarrow nightly 🥲

Same!

I'm seeing that recently in #2572 as well

@dangotbanned dangotbanned changed the title refactor: Simplify Implementation.to_native_namspace refactor: Simplify Implementation.to_native_namespace Jun 12, 2025
Comment on lines 573 to 584

As a biproduct of loading the native namespace, we also store it as an attribute
under the `_native_namespace` name.
"""
if self is Implementation.UNKNOWN: # pragma: no cover
msg = "Cannot return backend version from UNKNOWN Implementation"
raise AssertionError(msg)

from importlib import import_module

module_name = _IMPLEMENTATION_TO_MODULE_NAME.get(self, self.value)
self._native_namespace = import_module(module_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to store modules on enums (or any state for that matter)

To be super clear, my comment in (#2620 (comment)) was meant to mean let's defer caching to another issue.

I was just about to merge and was pretty surprised to see this added in a commit titled rollback _backend_verson.

If we need to add caching of module loading, I would much prefer to just lift the _LazyModule stuff from polars - since that has typing support built in (https://github.com/pola-rs/polars/blob/7cb263a6dc4e3a1893b6ad4af963733c83268b60/py-polars/polars/dependencies.py)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to avoid repeating the exact same imports to check the backend version and return the module.

I am a bit in a rush but I just made a commit to factor out and cache in a helper function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants