feat: add DataFrameLike parameter for cross-backend dataframe inputs#1144
feat: add DataFrameLike parameter for cross-backend dataframe inputs#1144ghostiee-11 wants to merge 5 commits into
Conversation
`param.DataFrame` is restricted to pandas. `DataFrameLike` accepts any object Narwhals recognises (pandas, Polars, PyArrow, cuDF, Modin) and passes it through unchanged, so existing pandas-only code is unaffected (`param.DataFrame` is not touched). * New `DataFrameLike(ClassSelector)` validating via `narwhals.from_native(eager_only=not allow_lazy, pass_through=False)`. Narwhals is an optional dependency, deferred like pandas is for `DataFrame`. * Same `rows` / `columns` / `ordered` slots as `DataFrame`, driven through the Narwhals wrapper so they work on every backend. Column names read via `collect_schema().names()` so lazy frames are not implicitly collected. * `allow_lazy=True` opts into lazy frames (Polars LazyFrame, Dask, DuckDB); row-count validation is skipped for lazy frames. * Backend-neutral `serialize` (list of records via Narwhals); `deserialize` reuses `DataFrame.deserialize` since JSON carries no backend information. * `_length_bounds_check` extracted to a module-level helper shared by `DataFrame` and `DataFrameLike` (behaviour-preserving; testpandas unchanged). * tests/testdataframelike.py covering pandas / Polars / PyArrow / lazy / serialization; narwhals + polars added to test-only dependencies.
* Raise a clear ImportError naming the install command when the optional narwhals package is missing, instead of a bare ModuleNotFoundError (declaration-time fail-fast, matching how DataFrame fails on missing pandas). * Document the serialization asymmetry (backend-neutral records out, pandas in) and that cuDF/Modin are Narwhals-supported but not run in CI (cuDF is GPU-only, Modin's pinned deps conflict with the test environment). * Annotate the inherited in-place ordered defaulting as deliberate DataFrame parity. * Add a skip-guarded Modin test and add narwhals + polars to the type-check environment so pyright validates the Narwhals API rather than skipping an unresolved import.
Remove cuDF/Modin name-drops from the docstring, error message and tests. They are reachable through Narwhals like any other backend but are not exercised here (no GPU; Modin's pinned deps conflict), so naming them as features overclaims. The validation path is described generically as "any Narwhals-supported backend" with pandas, Polars and PyArrow as the tested set. Drops the permanently-skipped Modin test.
|
Hey @philippjfr !! Looking for your views over this. Thanks!! |
hoxbro
left a comment
There was a problem hiding this comment.
Left some comments.
I'm not entirely sure if we should keep the underlying DataFrame or convert it to a narwhals DataFrame/LazyFrame.
A user would need to do the conversion in every Parameterized class methods which uses it as they API is very different for the DataFrame APIs.
Also how much AI have you used for this PR?
| return narwhals | ||
|
|
||
|
|
||
| class DataFrameLike(ClassSelector[t.Any]): |
There was a problem hiding this comment.
Not sure if we should call this DataFrameLike, whereas it make sense for something like ArrayLike as most package conforms to the numpy syntax. Narwhals is not a global standard and may confuse users who are not well-versed in the DataFrame ecosystem.
There was a problem hiding this comment.
Yeah okay, but as per Andrew GSoC goal planning doc, the param named as DataFrameLike only, can make it to ArrayLike, will change this to ArrayLike only.
There was a problem hiding this comment.
My point with ArrayLike was related to your other PR. I don't want this to be called Arraylike.
| narwhals = _get_narwhals() | ||
| try: | ||
| return narwhals.from_native( | ||
| val, eager_only=not self.allow_lazy, pass_through=False |
There was a problem hiding this comment.
Maybe we should just call it eager_only? Though not sure about this.
There was a problem hiding this comment.
I picked allow_lazy=False because the default is "reject lazy", and a positive-named flag reads better at the call site (allow_lazy=True is easier to parse than eager_only=False). But eager_only does mirror the narwhals from_native(eager_only=...) kwarg we delegate to, which is a real consistency win. Will flip it to eager_only=True (default) unless you'd rather keep the positive-named form.
| # ``collect_schema().names()`` is the Narwhals-recommended way to read | ||
| # column names uniformly across eager and lazy frames; ``.columns`` on | ||
| # a lazy frame triggers a backend schema-resolution warning. |
There was a problem hiding this comment.
Don't see the need for this comment.
| Unlike :class:`DataFrame`, which is restricted to ``pandas.DataFrame``, | ||
| ``DataFrameLike`` accepts any object supported by | ||
| `Narwhals <https://narwhals-dev.github.io>`_. pandas, Polars and PyArrow | ||
| are exercised in this project's test suite; any other Narwhals-supported | ||
| backend uses the identical code path. The value is passed through | ||
| unchanged, so reading the parameter returns the original native object | ||
| (no Narwhals wrapper). Authors who want a backend-agnostic API can call | ||
| ``narwhals.from_native`` on the value themselves. | ||
|
|
||
| Narwhals is an optional dependency, imported on instantiation; a clear | ||
| ``ImportError`` with the install command is raised if it is missing. The | ||
| structure of the frame can be constrained by the rows and columns | ||
| arguments: |
There was a problem hiding this comment.
I think this is heavy influenced by an LLM and should be rewritten to be more compact.
For example, the link is wrong.
There was a problem hiding this comment.
Will rewrite. The correct URL is https://narwhals-dev.github.io/narwhals/. Compacting the whole docstring to roughly the same length as the existing DataFrame one, dropping the prose.
| # Row count requires materialising a lazy frame; skip for lazy so | ||
| # the frame is never implicitly collected. | ||
| if self.rows is not None and not is_lazy: | ||
| _length_bounds_check(self, self.rows, nwframe.shape[0], 'row') |
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks!! will look into this and fix it.
| if os.getenv('PARAM_TEST_NARWHALS', '0') == '1': | ||
| raise ImportError("PARAM_TEST_NARWHALS=1 but narwhals not available.") |
There was a problem hiding this comment.
I think you should keep this simpler. No need to have an undocumented environment variable.
There was a problem hiding this comment.
Agreed, dropping the env var.
| @skip_no_pandas | ||
| @skip_no_polars |
There was a problem hiding this comment.
I think we can rewrite this to be polars only.
There was a problem hiding this comment.
Will do, TestDataFrameLikeAllowLazyonly exercises lazy semantics and Polars LazyFrame is the only backend that lets us assert "no implicit collect" deterministically. Dropping the pandas skip from the class.
| rebuild from the records form. | ||
| """ | ||
|
|
||
| __slots__ = ['rows', 'columns', 'ordered', 'allow_lazy'] |
There was a problem hiding this comment.
Maybe we should have something like stable_version?
There was a problem hiding this comment.
Yes, will switch the import to import narwhals.stable.v2 as nw and use that throughout.
| skip_no_pyarrow = pytest.mark.skipif(pa is None, reason="pyarrow not available") | ||
|
|
||
|
|
||
| class TestDataFrameLikeDefaults(unittest.TestCase): |
There was a problem hiding this comment.
I would rewrite these to not inherit from unittest.TestCase and use assert instead of.
There was a problem hiding this comment.
Yeah!! True will change this to assert thing only
| return numpy.asarray(value) | ||
|
|
||
|
|
||
| def _length_bounds_check(parameter, bounds, length, name): |
There was a problem hiding this comment.
You could also make a _DataFrameMixin class which have the shared functionality between them.
There was a problem hiding this comment.
Yeah, Thanks for this feedback, will try this!!
There was a problem hiding this comment.
Will take this to follow up pr!!
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1144 +/- ##
==========================================
+ Coverage 86.75% 86.80% +0.04%
==========================================
Files 9 9
Lines 5302 5380 +78
==========================================
+ Hits 4600 4670 +70
- Misses 702 710 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return pandas.DataFrame(value) | ||
|
|
||
|
|
||
| def _get_narwhals(): |
There was a problem hiding this comment.
Probably want to move this into _utils.py
| # ``collect_schema().names()`` is the Narwhals-recommended way to read | ||
| # column names uniformly across eager and lazy frames; ``.columns`` on | ||
| # a lazy frame triggers a backend schema-resolution warning. | ||
| cols = list(nwframe.collect_schema().names()) |
There was a problem hiding this comment.
Let's avoid collecting the cols unless needed.
Thanks for the review. On native vs. narwhals return: pass-through is intentional. Returning a narwhals.DataFrame would break any consumer that calls On AI usage: drafting and refactoring assistant. Design is mine and matches what I argued for in #975. Will trim the LLM-flavoured prose in the docstring this round. |
* Move _get_narwhals to _utils and use narwhals.stable.v2 * Rename allow_lazy to eager_only (default True) * Validate row count on LazyFrame via narwhals .count() instead of skipping * Skip collect_schema() unless columns/ordered or a lazy row check needs it * Compact docstring, fix narwhals URL, drop noisy inline comment * Rewrite tests as plain pytest functions, drop PARAM_TEST_NARWHALS env var, scope lazy tests to polars
…eckers * tests/testdefaults.py: append DataFrameLike to skip list when narwhals is unavailable, matching the existing pandas/numpy pattern * param/parameters.py: restructure DataFrameLike._validate so cols and schema have non-Optional types in the branches that use them, fixing pyrefly/pyright/ty errors flagged on CI

param.DataFrameonly acceptspandas.DataFrame, so there is no way to declare a parameter that holds tabular data when the value might be Polars, PyArrow, or another backend. This adds a newDataFrameLikeparameter that validates anything the Narwhals protocol recognises and passes the native object through unchanged.param.DataFrameis deliberately left untouched, so existing pandas-only code keeps its guarantee. This is the separate-class direction discussed in #975; serialization backend-preservation is intentionally left as an open question there.Same
rows/columns/orderedslots asDataFrame(driven through Narwhals so they work on every backend), plusallow_lazy=Truefor PolarsLazyFrame/ Dask / DuckDB with no implicit collect. Narwhals is an optional dependency, deferred like pandas is forDataFrame, with a clear install message if missing.Before


After
Tested with pandas, Polars (eager + lazy), and PyArrow; full suite 1550 passed,
testpandas.pyunchanged. Validation goes entirely through Narwhals, so any other Narwhals-supported backend uses the identical code path.