feat(python/sedonadb): add DataFrame.drop#871
Draft
jiayuasu wants to merge 1 commit into
Draft
Conversation
Pandas-style column drop on the lazy DataFrame, matching the
varargs/non-pandas-keyword pattern locked in the sort PR.
API:
df.drop("a")
df.drop("a", "b", "c")
- Varargs of column names. No `columns=` kwarg; Python's standard
unexpected-keyword TypeError covers misuse.
- Strings only. Expr arguments are rejected at the Python boundary —
drop is a schema op, not an expression op, and `df.drop(col("x") +
col("y"))` has no meaning.
- Empty args raise ValueError; non-str args raise TypeError.
Unknown-column behavior: DataFusion's `drop_columns` is permissive
and silently no-ops on names that aren't in the schema, which hides
typos. We validate Python-side and raise a `KeyError` listing the
available columns instead — matching pandas. The exact KeyError
message is locked by `test_drop_unknown_column_raises_keyerror`.
Rust side: `InternalDataFrame::drop_columns` is a thin wrapper that
materializes a `Vec<&str>` and calls DataFusion's `DataFrame::drop_columns`.
Step-by-step comments explain why we accept owned strings from
Python and borrow at the call boundary.
Tests cover single-column, multi-column, column-order preservation,
lazy return, both error paths (empty / non-str), Expr-arg rejection,
the kwarg rejection, and the typo-protecting KeyError.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Continues Phase P2 of #791 with
DataFrame.drop— the smallest of the remaining P2 ops.API
columns=kwarg). Same pattern assort()from feat(python/sedonadb): add DataFrame.sort with composable SortExpr #859 — matches DataFusion-python / Ibis / Polars, avoids the pandas-style keyword.Exprarguments are rejected at the Python boundary; "drop a computed expression" has no meaning at the schema level.Unknown-column behavior
Worth calling out — DataFusion's
DataFrame::drop_columnsis permissive: it silently no-ops on names that aren't in the schema. That hides typos. To match user expectations (and pandas'KeyErrorbehavior), this PR validates the column names Python-side and raises aKeyErrorlisting the available columns when one is missing. The exact message format is pinned bytest_drop_unknown_column_raises_keyerror.This is more restrictive than
select/filter(where DataFusion validates at plan-build time and includes "Valid fields are X, Y" in the error). The asymmetry is forced by DataFusion's silent-no-op default ondrop_columns; the workaround is one Python-side schema lookup per call, which is cheap.Implementation
python/sedonadb/src/dataframe.rsInternalDataFrame::drop_columns(Vec<String>). Materializes aVec<&str>and calls DataFusion'sDataFrame::drop_columns. Step-by-step comments.python/sedonadb/python/sedonadb/dataframe.pyDataFrame.drop(*cols: str). Validates non-empty, str-only, and known columns.Test plan
9 tests in
tests/expr/test_dataframe_drop.py:isinstance(out, DataFrame).ValueError; non-str arg →TypeError;Exprarg →TypeError;columns=kwarg → Python's unexpected-keywordTypeError; unknown column →KeyErrorwith exact pinned message listing available columns.Local: 9 unit + 19 doctests +
ruff format+ruff checkall clean.