Skip to content

Commit a5a8a04

Browse files
author
Vladimir Vilimaitis
committed
Add spreadsheet cleanup and schema inspection
1 parent 85828a8 commit a5a8a04

11 files changed

Lines changed: 1367 additions & 19 deletions

File tree

README.md

Lines changed: 119 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This project is inspired by R's janitor, but it is not a parity port. The aim is
88

99
Polars already has a strong API. Most cleanup work should stay plain Polars.
1010

11-
The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, empty rows, all-null columns, constant columns, and duplicate records by key. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
11+
The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, header rows hiding inside spreadsheet data, empty rows, all-null columns, constant columns, duplicate records by key, and quick schema checks before you combine frames. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
1212

1313
The package does not register a dataframe namespace. Import it next to Polars:
1414

@@ -95,6 +95,44 @@ pj.make_clean_names(["Customer ID", "% Complete"], case="constant")
9595

9696
Name cleaning is deterministic. It handles duplicate names, empty names, whitespace, symbols, mixed casing, common diacritics, and Python `None`. Other Python objects are converted with `str(...)`.
9797

98+
### Promote spreadsheet rows to names
99+
100+
Spreadsheet exports often put notes above the real header row. Use `find_header` to locate the first row where every cell is present and non-blank, then `row_to_names` to promote that row to cleaned column names.
101+
102+
```python
103+
raw = pl.DataFrame(
104+
{
105+
"column_1": [None, "Customer ID", "101", "101", "102"],
106+
"column_2": ["notes", "Order Date", "2026-01-01", "2026-01-01", "2026-01-02"],
107+
"column_3": ["", "% Complete", "0.5", "0.75", "1.0"],
108+
}
109+
)
110+
111+
header = pj.find_header(raw)
112+
cleaned = pj.row_to_names(raw, header)
113+
114+
print(header)
115+
# 1
116+
117+
print(cleaned.columns)
118+
# ["customer_id", "order_date", "percent_complete"]
119+
```
120+
121+
`row_to_names` uses 0-based row numbers, like Python indexing. If you omit the row, it calls `find_header` for you.
122+
123+
```python
124+
cleaned = pj.row_to_names(raw)
125+
```
126+
127+
You can also search for a known marker in one column.
128+
129+
```python
130+
pj.find_header(raw, value="Customer ID", column="column_1")
131+
# 1
132+
```
133+
134+
`find_header` and `row_to_names` are eager-only because they need to inspect values.
135+
98136
### Remove empty rows and columns
99137

100138
Use `remove_empty` to drop rows where every selected column is null, columns where every value is null, or both.
@@ -184,7 +222,15 @@ shape: (5, 3)
184222
You can pass more than one key.
185223

186224
```python
187-
pj.get_dupes(df, keys=["customer_id", "date"])
225+
orders = pl.DataFrame(
226+
{
227+
"customer_id": [101, 101, 101, 102],
228+
"date": ["2026-01-01", "2026-01-01", "2026-01-02", "2026-01-01"],
229+
"amount": [10.0, 12.0, 9.0, 7.0],
230+
}
231+
)
232+
233+
pj.get_dupes(orders, keys=["customer_id", "date"])
188234
```
189235

190236
You can also omit the count column.
@@ -195,6 +241,47 @@ pj.get_dupes(df, keys="id", include_count=False)
195241

196242
`get_dupes` works with eager and lazy frames.
197243

244+
### Compare frame schemas
245+
246+
Use `compare_df_cols` when you want a small schema report before joining, concatenating, or handing frames to another pipeline.
247+
248+
```python
249+
left = pl.DataFrame({"id": [1], "amount": [10.0], "status": ["new"]})
250+
right = pl.DataFrame({"id": [2], "amount": ["10.0"], "created_at": ["2026-01-01"]})
251+
252+
comparison = pj.compare_df_cols({"left": left, "right": right.lazy()})
253+
print(comparison)
254+
```
255+
256+
```text
257+
shape: (4, 3)
258+
┌─────────────┬─────────┬────────┐
259+
│ column_name ┆ left ┆ right │
260+
│ --- ┆ --- ┆ --- │
261+
│ str ┆ str ┆ str │
262+
╞═════════════╪═════════╪════════╡
263+
│ id ┆ Int64 ┆ Int64 │
264+
│ amount ┆ Float64 ┆ String │
265+
│ status ┆ String ┆ null │
266+
│ created_at ┆ null ┆ String │
267+
└─────────────┴─────────┴────────┘
268+
```
269+
270+
Filter to only matches or mismatches with `return_`.
271+
272+
```python
273+
pj.compare_df_cols({"left": left, "right": right}, return_="mismatch")
274+
```
275+
276+
Use `compare_df_cols_same` when you only need a boolean.
277+
278+
```python
279+
pj.compare_df_cols_same({"left": left, "right": right})
280+
# False
281+
```
282+
283+
Schema comparison supports eager and lazy frames. It uses lazy schemas and does not collect lazy data.
284+
198285
## Example
199286

200287
Run the small messy-dataframe example from a checkout:
@@ -203,13 +290,13 @@ Run the small messy-dataframe example from a checkout:
203290
uv run --extra dev python examples\messy_dataframe.py
204291
```
205292

206-
The example cleans names, removes empty rows and columns, drops constant columns, and then returns duplicate customer records.
293+
The example promotes a spreadsheet header row, cleans names, removes empty rows and columns, drops constant columns, returns duplicate customer records, and compares schemas.
207294

208295
## What this is not
209296

210297
This is not a dataframe namespace package. There is no `df.janitor.clean_names()` registration on import.
211298

212-
This MVP also leaves out helpers that Polars already handles clearly:
299+
This package also leaves out helpers that Polars already handles clearly:
213300

214301
- rounding
215302
- string concatenation
@@ -228,14 +315,34 @@ Those may be useful in R, but in Polars they either duplicate existing APIs or p
228315

229316
## Known limits
230317

231-
LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(..., axis="rows")`, and `get_dupes` can build lazy plans without collecting data. Column-removing helpers that need to inspect values are eager-only.
318+
LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(..., axis="rows")`, `get_dupes`, `compare_df_cols`, and `compare_df_cols_same` can work from lazy schemas or build lazy plans without collecting data. Helpers that need to inspect values are eager-only: `find_header`, `row_to_names`, `remove_constant`, and `remove_empty(..., axis="cols" | "both")`.
232319

233320
The package supports Python Polars `1.29.0` and newer. Compatibility tests run against that lower bound and the current lockfile version.
234321

235322
The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Eager frames cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build.
236323

237324
The compiled extension is CPython-version-specific. If `import polars_janitor` fails after changing Python versions, rebuild with `maturin develop --release` or reinstall from the wheel for that interpreter.
238325

326+
## Benchmarks
327+
328+
These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim.
329+
330+
The R comparison uses base R `data.frame`s because janitor is a data.frame/tibble package. The pyjanitor comparison uses pandas for the same reason.
331+
332+
| Task | Size | polars-janitor | pyjanitor/pandas | R janitor |
333+
| --- | ---: | ---: | ---: | ---: |
334+
| clean_names | 10,000 columns | 45.38 ms | 34.89 ms | 4710.00 ms |
335+
| compare_df_cols | 5,000 columns | 14.51 ms | 302.32 ms | 70.00 ms |
336+
| row_to_names + clean_names | 2,000 columns | 8.43 ms | 46.45 ms | 940.00 ms |
337+
338+
Run the same benchmark from a checkout:
339+
340+
```powershell
341+
uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py
342+
```
343+
344+
If R is installed and the `janitor` package is available to that R installation, the script includes the R column. Otherwise it prints the Python comparisons.
345+
239346
## Rust implementation
240347

241348
The public package is Python, but the implementation is Rust.
@@ -268,8 +375,14 @@ ruff check .
268375
uv run --extra dev pytest
269376
```
270377

271-
Run the benchmark smoke test:
378+
Run the name-cleaning benchmark smoke test:
272379

273380
```powershell
274381
uv run --extra dev python benchmarks\benchmark_names.py
275382
```
383+
384+
Run the competitor benchmark:
385+
386+
```powershell
387+
uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py
388+
```

0 commit comments

Comments
 (0)