You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+119-6Lines changed: 119 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ This project is inspired by R's janitor, but it is not a parity port. The aim is
8
8
9
9
Polars already has a strong API. Most cleanup work should stay plain Polars.
10
10
11
-
The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, empty rows, all-null columns, constant columns, and duplicate records by key. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
11
+
The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, header rows hiding inside spreadsheet data, empty rows, all-null columns, constant columns, duplicate records by key, and quick schema checks before you combine frames. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
12
12
13
13
The package does not register a dataframe namespace. Import it next to Polars:
Name cleaning is deterministic. It handles duplicate names, empty names, whitespace, symbols, mixed casing, common diacritics, and Python `None`. Other Python objects are converted with `str(...)`.
97
97
98
+
### Promote spreadsheet rows to names
99
+
100
+
Spreadsheet exports often put notes above the real header row. Use `find_header` to locate the first row where every cell is present and non-blank, then `row_to_names` to promote that row to cleaned column names.
Schema comparison supports eager and lazy frames. It uses lazy schemas and does not collect lazy data.
284
+
198
285
## Example
199
286
200
287
Run the small messy-dataframe example from a checkout:
@@ -203,13 +290,13 @@ Run the small messy-dataframe example from a checkout:
203
290
uv run --extra dev python examples\messy_dataframe.py
204
291
```
205
292
206
-
The example cleans names, removes empty rows and columns, drops constant columns, and then returns duplicate customer records.
293
+
The example promotes a spreadsheet header row, cleans names, removes empty rows and columns, drops constant columns, returns duplicate customer records, and compares schemas.
207
294
208
295
## What this is not
209
296
210
297
This is not a dataframe namespace package. There is no `df.janitor.clean_names()` registration on import.
211
298
212
-
This MVP also leaves out helpers that Polars already handles clearly:
299
+
This package also leaves out helpers that Polars already handles clearly:
213
300
214
301
- rounding
215
302
- string concatenation
@@ -228,14 +315,34 @@ Those may be useful in R, but in Polars they either duplicate existing APIs or p
228
315
229
316
## Known limits
230
317
231
-
LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(..., axis="rows")`, and `get_dupes` can build lazy plans without collecting data. Column-removing helpers that need to inspect values are eager-only.
318
+
LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(..., axis="rows")`, `get_dupes`, `compare_df_cols`, and `compare_df_cols_same` can work from lazy schemas or build lazy plans without collecting data. Helpers that need to inspect values are eager-only: `find_header`, `row_to_names`, `remove_constant`, and `remove_empty(..., axis="cols" | "both")`.
232
319
233
320
The package supports Python Polars `1.29.0` and newer. Compatibility tests run against that lower bound and the current lockfile version.
234
321
235
322
The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Eager frames cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build.
236
323
237
324
The compiled extension is CPython-version-specific. If `import polars_janitor` fails after changing Python versions, rebuild with `maturin develop --release` or reinstall from the wheel for that interpreter.
238
325
326
+
## Benchmarks
327
+
328
+
These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim.
329
+
330
+
The R comparison uses base R `data.frame`s because janitor is a data.frame/tibble package. The pyjanitor comparison uses pandas for the same reason.
| clean_names | 10,000 columns | 45.38 ms | 34.89 ms | 4710.00 ms |
335
+
| compare_df_cols | 5,000 columns | 14.51 ms | 302.32 ms | 70.00 ms |
336
+
| row_to_names + clean_names | 2,000 columns | 8.43 ms | 46.45 ms | 940.00 ms |
337
+
338
+
Run the same benchmark from a checkout:
339
+
340
+
```powershell
341
+
uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py
342
+
```
343
+
344
+
If R is installed and the `janitor` package is available to that R installation, the script includes the R column. Otherwise it prints the Python comparisons.
345
+
239
346
## Rust implementation
240
347
241
348
The public package is Python, but the implementation is Rust.
@@ -268,8 +375,14 @@ ruff check .
268
375
uv run --extra dev pytest
269
376
```
270
377
271
-
Run the benchmark smoke test:
378
+
Run the name-cleaning benchmark smoke test:
272
379
273
380
```powershell
274
381
uv run --extra dev python benchmarks\benchmark_names.py
275
382
```
383
+
384
+
Run the competitor benchmark:
385
+
386
+
```powershell
387
+
uv run --extra dev --with pandas --with pyjanitor python benchmarks\benchmark_competitors.py
0 commit comments