Skip to content

feat(python/sedonadb): add DataFrame.drop#871

Draft
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-drop
Draft

feat(python/sedonadb): add DataFrame.drop#871
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-drop

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

Continues Phase P2 of #791 with DataFrame.drop — the smallest of the remaining P2 ops.

API

df.drop("a")
df.drop("a", "b", "c")

Unknown-column behavior

Worth calling out — DataFusion's DataFrame::drop_columns is permissive: it silently no-ops on names that aren't in the schema. That hides typos. To match user expectations (and pandas' KeyError behavior), this PR validates the column names Python-side and raises a KeyError listing the available columns when one is missing. The exact message format is pinned by test_drop_unknown_column_raises_keyerror.

This is more restrictive than select/filter (where DataFusion validates at plan-build time and includes "Valid fields are X, Y" in the error). The asymmetry is forced by DataFusion's silent-no-op default on drop_columns; the workaround is one Python-side schema lookup per call, which is cheap.

Implementation

File Change
python/sedonadb/src/dataframe.rs New InternalDataFrame::drop_columns(Vec<String>). Materializes a Vec<&str> and calls DataFusion's DataFrame::drop_columns. Step-by-step comments.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.drop(*cols: str). Validates non-empty, str-only, and known columns.

Test plan

9 tests in tests/expr/test_dataframe_drop.py:

  • Positive: single-column, multi-column, column-order preservation.
  • Lazy return: isinstance(out, DataFrame).
  • Errors: empty args → ValueError; non-str arg → TypeError; Expr arg → TypeError; columns= kwarg → Python's unexpected-keyword TypeError; unknown column → KeyError with exact pinned message listing available columns.

Local: 9 unit + 19 doctests + ruff format + ruff check all clean.

Pandas-style column drop on the lazy DataFrame, matching the
varargs/non-pandas-keyword pattern locked in the sort PR.

API:

    df.drop("a")
    df.drop("a", "b", "c")

- Varargs of column names. No `columns=` kwarg; Python's standard
  unexpected-keyword TypeError covers misuse.
- Strings only. Expr arguments are rejected at the Python boundary —
  drop is a schema op, not an expression op, and `df.drop(col("x") +
  col("y"))` has no meaning.
- Empty args raise ValueError; non-str args raise TypeError.

Unknown-column behavior: DataFusion's `drop_columns` is permissive
and silently no-ops on names that aren't in the schema, which hides
typos. We validate Python-side and raise a `KeyError` listing the
available columns instead — matching pandas. The exact KeyError
message is locked by `test_drop_unknown_column_raises_keyerror`.

Rust side: `InternalDataFrame::drop_columns` is a thin wrapper that
materializes a `Vec<&str>` and calls DataFusion's `DataFrame::drop_columns`.
Step-by-step comments explain why we accept owned strings from
Python and borrow at the call boundary.

Tests cover single-column, multi-column, column-order preservation,
lazy return, both error paths (empty / non-str), Expr-arg rejection,
the kwarg rejection, and the typo-protecting KeyError.
@github-actions github-actions Bot requested a review from prantogg May 23, 2026 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant