polars-janitor

Small janitorial helpers for Polars dataframes.

pip install polars-janitor

This project is inspired by R's janitor, but it is not a parity port. The aim is smaller: keep a few boring dataframe cleanup chores easy, predictable, and Polars-shaped.

Why this exists

Polars already has a strong API. Most cleanup work should stay plain Polars.

The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, header rows hiding inside spreadsheet data, empty rows, all-null columns, constant columns, duplicate records by key, and quick schema checks before you combine frames. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.

polars-janitor owes a lot to R's janitor and pyjanitor. Those projects made the case that small cleanup helpers are worth having. This package borrows that spirit, but keeps the API narrow and Polars-shaped.

The package does not register a dataframe namespace. Import it next to Polars:

import polars as pl
import polars_janitor as pj

Python usage

Clean column names

Use clean_names when you already have a DataFrame or LazyFrame.

df = pl.DataFrame(
    {
        "Customer ID": [1, 2],
        "% Complete": [0.5, 1.0],
        "OrderID": ["A001", "A002"],
    }
)

cleaned = pj.clean_names(df)

print(cleaned.columns)
# ["customer_id", "percent_complete", "order_id"]

clean_names also works on LazyFrame. It uses the static schema, so it does not need to collect the data.

lazy = pj.clean_names(df.lazy())
result = lazy.collect()

Use make_clean_names when you only need names back.

names = pj.make_clean_names(
    ["Customer ID", "Customer ID", "% Complete", "Mötley Crüe", "", None, "1st Sale"]
)

print(names)

[
    "customer_id",
    "customer_id_2",
    "percent_complete",
    "motley_crue",
    "x",
    "x_2",
    "x_1_st_sale",
]

Supported case styles are snake, camel, pascal, and constant.

pj.make_clean_names(["Customer ID", "% Complete"], case="camel")
# ["customerId", "percentComplete"]

pj.make_clean_names(["Customer ID", "% Complete"], case="constant")
# ["CUSTOMER_ID", "PERCENT_COMPLETE"]

Name cleaning is deterministic. It handles duplicate names, empty names, whitespace, symbols, mixed casing, common diacritics, and Python None. Other Python objects are converted with str(...).

Promote spreadsheet rows to names

Spreadsheet exports often put notes above the real header row. Use find_header to locate the first row where every cell is present and non-blank, then row_to_names to promote that row to cleaned column names.

raw = pl.DataFrame(
    {
        "column_1": [None, "Customer ID", "101", "101", "102"],
        "column_2": ["notes", "Order Date", "2026-01-01", "2026-01-01", "2026-01-02"],
        "column_3": ["", "% Complete", "0.5", "0.75", "1.0"],
    }
)

header = pj.find_header(raw)
cleaned = pj.row_to_names(raw, header)

print(header)
# 1

print(cleaned.columns)
# ["customer_id", "order_date", "percent_complete"]

row_to_names uses 0-based row numbers, like Python indexing. If you omit the row, it calls find_header for you.

cleaned = pj.row_to_names(raw)

You can also search for a known marker in one column.

pj.find_header(raw, value="Customer ID", column="column_1")
# 1

find_header and row_to_names are eager-only because they need to inspect values.

Remove empty rows and columns

Use remove_empty to drop rows where every selected column is null, columns where every value is null, or both.

df = pl.DataFrame(
    {
        "a": [None, None, 1],
        "b": [None, None, None],
        "c": ["x", None, "z"],
    }
)

pj.remove_empty(df, axis="rows")
pj.remove_empty(df, axis="cols")
pj.remove_empty(df, axis="both")

You can limit the check to a subset of columns.

pj.remove_empty(df, axis="rows", subset=["a", "c"])

Lazy support is intentionally smaller here. LazyFrame supports axis="rows" because the schema does not change. axis="cols" and axis="both" are eager-only because Polars would need to inspect the data before knowing which columns still exist.

Remove constant columns

Use remove_constant to drop columns with one distinct value.

df = pl.DataFrame(
    {
        "constant": [1, 1, 1],
        "with_null": [1, None, 1],
        "varied": [1, 2, 1],
        "nulls": [None, None, None],
    }
)

pj.remove_constant(df)

By default, nulls count as a value. In the example above, with_null stays because it contains both 1 and null.

If you want nulls ignored during the constant check, pass ignore_nulls=True.

pj.remove_constant(df, ignore_nulls=True)

remove_constant is eager-only. Dropping constant columns from a LazyFrame would make the output schema depend on the data.

Get duplicate records

Use get_dupes to return every row whose key appears more than once.

df = pl.DataFrame(
    {
        "id": [1, 1, 2, 3, 3, 3],
        "value": ["a", "b", "c", "d", "e", "f"],
    }
)

dupes = pj.get_dupes(df, keys="id")
print(dupes)

The result includes a duplicate_count column by default.

shape: (5, 3)
┌─────┬───────┬─────────────────┐
│ id  ┆ value ┆ duplicate_count │
│ --- ┆ ---   ┆ ---             │
│ i64 ┆ str   ┆ u32             │
╞═════╪═══════╪═════════════════╡
│ 1   ┆ a     ┆ 2               │
│ 1   ┆ b     ┆ 2               │
│ 3   ┆ d     ┆ 3               │
│ 3   ┆ e     ┆ 3               │
│ 3   ┆ f     ┆ 3               │
└─────┴───────┴─────────────────┘

You can pass more than one key.

orders = pl.DataFrame(
    {
        "customer_id": [101, 101, 101, 102],
        "date": ["2026-01-01", "2026-01-01", "2026-01-02", "2026-01-01"],
        "amount": [10.0, 12.0, 9.0, 7.0],
    }
)

pj.get_dupes(orders, keys=["customer_id", "date"])

You can also omit the count column.

pj.get_dupes(df, keys="id", include_count=False)

get_dupes works with eager and lazy frames.

Compare frame schemas

Use compare_df_cols when you want a small schema report before joining, concatenating, or handing frames to another pipeline.

left = pl.DataFrame({"id": [1], "amount": [10.0], "status": ["new"]})
right = pl.DataFrame({"id": [2], "amount": ["10.0"], "created_at": ["2026-01-01"]})

comparison = pj.compare_df_cols({"left": left, "right": right.lazy()})
print(comparison)

shape: (4, 3)
┌─────────────┬─────────┬────────┐
│ column_name ┆ left    ┆ right  │
│ ---         ┆ ---     ┆ ---    │
│ str         ┆ str     ┆ str    │
╞═════════════╪═════════╪════════╡
│ id          ┆ Int64   ┆ Int64  │
│ amount      ┆ Float64 ┆ String │
│ status      ┆ String  ┆ null   │
│ created_at  ┆ null    ┆ String │
└─────────────┴─────────┴────────┘

Filter to only matches or mismatches with return_.

pj.compare_df_cols({"left": left, "right": right}, return_="mismatch")

Use compare_df_cols_same when you only need a boolean.

pj.compare_df_cols_same({"left": left, "right": right})
# False

Schema comparison supports eager and lazy frames. It uses lazy schemas and does not collect lazy data.

What this is not

This is not a dataframe namespace package. There is no df.janitor.clean_names() registration on import.

This package also leaves out helpers that Polars already handles clearly:

rounding
string concatenation
value counts
pivot and crosstab wrappers
paste-style helpers

It also leaves out the more R-specific janitor surface:

tabyl
adorn_*
statistical tests
date parsing helpers

Those may be useful in R, but in Polars they either duplicate existing APIs or push the package toward a grab bag. The package should stay small enough that every public function earns its place.

Known limits

LazyFrame support is deliberately conservative. clean_names, remove_empty(..., axis="rows"), get_dupes, compare_df_cols, and compare_df_cols_same can work from lazy schemas or build lazy plans without collecting data. Helpers that need to inspect values are eager-only: find_header, row_to_names, remove_constant, and remove_empty(..., axis="cols" | "both").

The package supports CPython 3.10 through 3.14 and Python Polars 1.29.0 and newer. Compatibility tests run against that lower bound and the current lockfile version.

The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Most eager frame helpers cross through pyo3-polars; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build. clean_names is a little different: Rust cleans the names, then Polars' public rename API applies them.

The compiled extension is CPython-version-specific. If import polars_janitor fails after changing Python versions, rebuild with maturin develop --release or reinstall from the wheel for that interpreter.

Benchmarks

These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim or a dunk contest.

The R comparison uses base R data.frames because janitor is a data.frame/tibble package. pyjanitor has Polars methods for clean_names and row_to_names, so those are shown separately. Its compare_df_cols helper is pandas-only in the tested version.

Task	Size	polars-janitor	pyjanitor/Polars	pyjanitor/pandas	R janitor
clean_names	10,000 columns	14.25 ms	159.68 ms	38.27 ms	5030.00 ms
compare_df_cols	5,000 columns	15.53 ms	n/a	277.58 ms	70.00 ms
row_to_names + clean_names	2,000 columns	8.39 ms	32.45 ms	44.29 ms	950.00 ms

Run the same benchmark from a checkout:

uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py

If R is installed and the janitor package is available to that R installation, the script includes the R column. Otherwise it prints the Python comparisons.

Rust implementation

The public package is Python. The cleanup logic lives in Rust, with a thin Python layer where using Polars' own public API is faster or more compatible.

The Rust code is split into three modules:

names: name normalization, case conversion, Unicode cleanup, and duplicate suffixing
frame: eager Polars dataframe operations
python: PyO3 bindings, argument parsing, LazyFrame plan construction, and error mapping

This is not an expression plugin. These functions operate on schemas or whole frames, not on a single expression inside a query.

Generated build files are not source. Local development may create _rust.*.pyd, _rust.*.so, .pdb, __pycache__, .venv, dist/, and target/; the project ignores those.

Development

Build the extension into the local virtual environment:

uv run --extra dev maturin develop --release

Run the checks:

cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test
ruff check .
uv run --extra dev pytest

Run the name-cleaning benchmark smoke test:

uv run --extra dev python benchmarks\benchmark_names.py

Run the competitor benchmark:

uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
rust/src		rust/src
scripts		scripts
src/polars_janitor		src/polars_janitor
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polars-janitor

Why this exists

Python usage

Clean column names

Promote spreadsheet rows to names

Remove empty rows and columns

Remove constant columns

Get duplicate records

Compare frame schemas

What this is not

Known limits

Benchmarks

Rust implementation

Development

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

polars-janitor

Why this exists

Python usage

Clean column names

Promote spreadsheet rows to names

Remove empty rows and columns

Remove constant columns

Get duplicate records

Compare frame schemas

What this is not

Known limits

Benchmarks

Rust implementation

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages