Skip to content

feat: add DataFrameLike parameter for cross-backend dataframe inputs#1144

Open
ghostiee-11 wants to merge 5 commits into
holoviz:mainfrom
ghostiee-11:feat/dataframelike-param
Open

feat: add DataFrameLike parameter for cross-backend dataframe inputs#1144
ghostiee-11 wants to merge 5 commits into
holoviz:mainfrom
ghostiee-11:feat/dataframelike-param

Conversation

@ghostiee-11
Copy link
Copy Markdown

@ghostiee-11 ghostiee-11 commented May 17, 2026

param.DataFrame only accepts pandas.DataFrame, so there is no way to declare a parameter that holds tabular data when the value might be Polars, PyArrow, or another backend. This adds a new DataFrameLike parameter that validates anything the Narwhals protocol recognises and passes the native object through unchanged. param.DataFrame is deliberately left untouched, so existing pandas-only code keeps its guarantee. This is the separate-class direction discussed in #975; serialization backend-preservation is intentionally left as an open question there.

Same rows / columns / ordered slots as DataFrame (driven through Narwhals so they work on every backend), plus allow_lazy=True for Polars LazyFrame / Dask / DuckDB with no implicit collect. Narwhals is an optional dependency, deferred like pandas is for DataFrame, with a clear install message if missing.

Before
dflike-before
After
dflike-after

Tested with pandas, Polars (eager + lazy), and PyArrow; full suite 1550 passed, testpandas.py unchanged. Validation goes entirely through Narwhals, so any other Narwhals-supported backend uses the identical code path.

`param.DataFrame` is restricted to pandas. `DataFrameLike` accepts any
object Narwhals recognises (pandas, Polars, PyArrow, cuDF, Modin) and
passes it through unchanged, so existing pandas-only code is unaffected
(`param.DataFrame` is not touched).

* New `DataFrameLike(ClassSelector)` validating via
  `narwhals.from_native(eager_only=not allow_lazy, pass_through=False)`.
  Narwhals is an optional dependency, deferred like pandas is for
  `DataFrame`.
* Same `rows` / `columns` / `ordered` slots as `DataFrame`, driven
  through the Narwhals wrapper so they work on every backend. Column
  names read via `collect_schema().names()` so lazy frames are not
  implicitly collected.
* `allow_lazy=True` opts into lazy frames (Polars LazyFrame, Dask,
  DuckDB); row-count validation is skipped for lazy frames.
* Backend-neutral `serialize` (list of records via Narwhals);
  `deserialize` reuses `DataFrame.deserialize` since JSON carries no
  backend information.
* `_length_bounds_check` extracted to a module-level helper shared by
  `DataFrame` and `DataFrameLike` (behaviour-preserving; testpandas
  unchanged).
* tests/testdataframelike.py covering pandas / Polars / PyArrow / lazy
  / serialization; narwhals + polars added to test-only dependencies.
* Raise a clear ImportError naming the install command when the
  optional narwhals package is missing, instead of a bare
  ModuleNotFoundError (declaration-time fail-fast, matching how
  DataFrame fails on missing pandas).
* Document the serialization asymmetry (backend-neutral records out,
  pandas in) and that cuDF/Modin are Narwhals-supported but not run
  in CI (cuDF is GPU-only, Modin's pinned deps conflict with the
  test environment).
* Annotate the inherited in-place ordered defaulting as deliberate
  DataFrame parity.
* Add a skip-guarded Modin test and add narwhals + polars to the
  type-check environment so pyright validates the Narwhals API
  rather than skipping an unresolved import.
Remove cuDF/Modin name-drops from the docstring, error message and
tests. They are reachable through Narwhals like any other backend but
are not exercised here (no GPU; Modin's pinned deps conflict), so
naming them as features overclaims. The validation path is described
generically as "any Narwhals-supported backend" with pandas, Polars
and PyArrow as the tested set. Drops the permanently-skipped Modin
test.
@ghostiee-11 ghostiee-11 marked this pull request as ready for review May 17, 2026 11:51
@ghostiee-11
Copy link
Copy Markdown
Author

Hey @philippjfr !! Looking for your views over this.

Thanks!!

Copy link
Copy Markdown
Member

@hoxbro hoxbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

I'm not entirely sure if we should keep the underlying DataFrame or convert it to a narwhals DataFrame/LazyFrame.

A user would need to do the conversion in every Parameterized class methods which uses it as they API is very different for the DataFrame APIs.

Also how much AI have you used for this PR?

Comment thread param/parameters.py
return narwhals


class DataFrameLike(ClassSelector[t.Any]):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should call this DataFrameLike, whereas it make sense for something like ArrayLike as most package conforms to the numpy syntax. Narwhals is not a global standard and may confuse users who are not well-versed in the DataFrame ecosystem.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah okay, but as per Andrew GSoC goal planning doc, the param named as DataFrameLike only, can make it to ArrayLike, will change this to ArrayLike only.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point with ArrayLike was related to your other PR. I don't want this to be called Arraylike.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh okay my bad..

Comment thread param/parameters.py Outdated
narwhals = _get_narwhals()
try:
return narwhals.from_native(
val, eager_only=not self.allow_lazy, pass_through=False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just call it eager_only? Though not sure about this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked allow_lazy=False because the default is "reject lazy", and a positive-named flag reads better at the call site (allow_lazy=True is easier to parse than eager_only=False). But eager_only does mirror the narwhals from_native(eager_only=...) kwarg we delegate to, which is a real consistency win. Will flip it to eager_only=True (default) unless you'd rather keep the positive-named form.

Comment thread param/parameters.py Outdated
Comment on lines +3679 to +3681
# ``collect_schema().names()`` is the Narwhals-recommended way to read
# column names uniformly across eager and lazy frames; ``.columns`` on
# a lazy frame triggers a backend schema-resolution warning.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see the need for this comment.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okayy!!

Comment thread param/parameters.py Outdated
Comment on lines +3527 to +3539
Unlike :class:`DataFrame`, which is restricted to ``pandas.DataFrame``,
``DataFrameLike`` accepts any object supported by
`Narwhals <https://narwhals-dev.github.io>`_. pandas, Polars and PyArrow
are exercised in this project's test suite; any other Narwhals-supported
backend uses the identical code path. The value is passed through
unchanged, so reading the parameter returns the original native object
(no Narwhals wrapper). Authors who want a backend-agnostic API can call
``narwhals.from_native`` on the value themselves.

Narwhals is an optional dependency, imported on instantiation; a clear
``ImportError`` with the install command is raised if it is missing. The
structure of the frame can be constrained by the rows and columns
arguments:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is heavy influenced by an LLM and should be rewritten to be more compact.

For example, the link is wrong.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rewrite. The correct URL is https://narwhals-dev.github.io/narwhals/. Compacting the whole docstring to roughly the same length as the existing DataFrame one, dropping the prose.

Comment thread param/parameters.py Outdated
# Row count requires materialising a lazy frame; skip for lazy so
# the frame is never implicitly collected.
if self.rows is not None and not is_lazy:
_length_bounds_check(self, self.rows, nwframe.shape[0], 'row')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use count() to get the row count for a LazyFrame.

import polars as pl
import narwhals.stable.v2 as nw

N = 100
lf = nw.from_native(pl.LazyFrame(dict(A=["a"] * N, B=["b"] * N)))
cols = list(lf.collect_schema().names())
lf.select(nw.col(cols[0]).count()).collect().item()
Image

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! will look into this and fix it.

Comment thread tests/testdataframelike.py Outdated
Comment on lines +13 to +14
if os.getenv('PARAM_TEST_NARWHALS', '0') == '1':
raise ImportError("PARAM_TEST_NARWHALS=1 but narwhals not available.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should keep this simpler. No need to have an undocumented environment variable.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, dropping the env var.

Comment thread tests/testdataframelike.py Outdated
Comment on lines +185 to +186
@skip_no_pandas
@skip_no_polars
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rewrite this to be polars only.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, TestDataFrameLikeAllowLazyonly exercises lazy semantics and Polars LazyFrame is the only backend that lets us assert "no implicit collect" deterministically. Dropping the pandas skip from the class.

Comment thread param/parameters.py Outdated
rebuild from the records form.
"""

__slots__ = ['rows', 'columns', 'ordered', 'allow_lazy']
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have something like stable_version?

Copy link
Copy Markdown
Author

@ghostiee-11 ghostiee-11 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will switch the import to import narwhals.stable.v2 as nw and use that throughout.

Comment thread tests/testdataframelike.py Outdated
skip_no_pyarrow = pytest.mark.skipif(pa is None, reason="pyarrow not available")


class TestDataFrameLikeDefaults(unittest.TestCase):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rewrite these to not inherit from unittest.TestCase and use assert instead of.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!! True will change this to assert thing only

Comment thread param/parameters.py
return numpy.asarray(value)


def _length_bounds_check(parameter, bounds, length, name):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also make a _DataFrameMixin class which have the shared functionality between them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Thanks for this feedback, will try this!!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take this to follow up pr!!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 90.80460% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.80%. Comparing base (77b09cc) to head (252a6f8).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
param/parameters.py 92.59% 6 Missing ⚠️
param/_utils.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1144      +/-   ##
==========================================
+ Coverage   86.75%   86.80%   +0.04%     
==========================================
  Files           9        9              
  Lines        5302     5380      +78     
==========================================
+ Hits         4600     4670      +70     
- Misses        702      710       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread param/parameters.py Outdated
return pandas.DataFrame(value)


def _get_narwhals():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to move this into _utils.py

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!! Moving it

Comment thread param/parameters.py Outdated
# ``collect_schema().names()`` is the Narwhals-recommended way to read
# column names uniformly across eager and lazy frames; ``.columns`` on
# a lazy frame triggers a backend schema-resolution warning.
cols = list(nwframe.collect_schema().names())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid collecting the cols unless needed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay suree

@ghostiee-11
Copy link
Copy Markdown
Author

Left some comments.

I'm not entirely sure if we should keep the underlying DataFrame or convert it to a narwhals DataFrame/LazyFrame.

A user would need to do the conversion in every Parameterized class methods which uses it as they API is very different for the DataFrame APIs.

Also how much AI have you used for this PR?

Thanks for the review.

On native vs. narwhals return: pass-through is intentional. Returning a narwhals.DataFrame would break any consumer that calls .iloc / .groupby / native methods on self.df, which defeats the point of a drop-in parameter. Authors who want the unified API can do nw.from_native(self.df) themselves (one line, the pattern Narwhals recommends). Validation already goes through narwhals internally, so the backend-agnostic surface is used where it matters. Happy to add an opt-in as_narwhals=True slot later if users ask for it.

On AI usage: drafting and refactoring assistant. Design is mine and matches what I argued for in #975. Will trim the LLM-flavoured prose in the docstring this round.

* Move _get_narwhals to _utils and use narwhals.stable.v2
* Rename allow_lazy to eager_only (default True)
* Validate row count on LazyFrame via narwhals .count() instead of skipping
* Skip collect_schema() unless columns/ordered or a lazy row check needs it
* Compact docstring, fix narwhals URL, drop noisy inline comment
* Rewrite tests as plain pytest functions, drop PARAM_TEST_NARWHALS env var,
  scope lazy tests to polars
…eckers

* tests/testdefaults.py: append DataFrameLike to skip list when narwhals
  is unavailable, matching the existing pandas/numpy pattern
* param/parameters.py: restructure DataFrameLike._validate so cols and
  schema have non-Optional types in the branches that use them, fixing
  pyrefly/pyright/ty errors flagged on CI
@ghostiee-11 ghostiee-11 requested a review from philippjfr June 2, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants