Skip to content

Commit 26f438d

Browse files
author
Vladimir Vilimaitis
committed
Optimize clean_names and refresh benchmark docs
1 parent 7552f6b commit 26f438d

5 files changed

Lines changed: 191 additions & 26 deletions

File tree

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Polars already has a strong API. Most cleanup work should stay plain Polars.
1010

1111
The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, header rows hiding inside spreadsheet data, empty rows, all-null columns, constant columns, duplicate records by key, and quick schema checks before you combine frames. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
1212

13+
`polars-janitor` owes a lot to R's janitor and pyjanitor. Those projects made the case that small cleanup helpers are worth having. This package borrows that spirit, but keeps the API narrow and Polars-shaped.
14+
1315
The package does not register a dataframe namespace. Import it next to Polars:
1416

1517
```python
@@ -319,21 +321,21 @@ LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(...
319321

320322
The package supports Python Polars `1.29.0` and newer. Compatibility tests run against that lower bound and the current lockfile version.
321323

322-
The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Eager frames cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build.
324+
The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Most eager frame helpers cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build. `clean_names` is a little different: Rust cleans the names, then Polars' public `rename` API applies them.
323325

324326
The compiled extension is CPython-version-specific. If `import polars_janitor` fails after changing Python versions, rebuild with `maturin develop --release` or reinstall from the wheel for that interpreter.
325327

326328
## Benchmarks
327329

328-
These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim.
330+
These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim or a dunk contest.
329331

330332
The R comparison uses base R `data.frame`s because janitor is a data.frame/tibble package. pyjanitor has Polars methods for `clean_names` and `row_to_names`, so those are shown separately. Its `compare_df_cols` helper is pandas-only in the tested version.
331333

332334
| Task | Size | polars-janitor | pyjanitor/Polars | pyjanitor/pandas | R janitor |
333335
| --- | ---: | ---: | ---: | ---: | ---: |
334-
| clean_names | 10,000 columns | 45.49 ms | 139.01 ms | 36.94 ms | 5690.00 ms |
335-
| compare_df_cols | 5,000 columns | 14.47 ms | n/a | 384.17 ms | 80.00 ms |
336-
| row_to_names + clean_names | 2,000 columns | 8.78 ms | 32.13 ms | 44.04 ms | 970.00 ms |
336+
| clean_names | 10,000 columns | 14.25 ms | 159.68 ms | 38.27 ms | 5030.00 ms |
337+
| compare_df_cols | 5,000 columns | 15.53 ms | n/a | 277.58 ms | 70.00 ms |
338+
| row_to_names + clean_names | 2,000 columns | 8.39 ms | 32.45 ms | 44.29 ms | 950.00 ms |
337339

338340
Run the same benchmark from a checkout:
339341

@@ -345,7 +347,7 @@ If R is installed and the `janitor` package is available to that R installation,
345347

346348
## Rust implementation
347349

348-
The public package is Python, but the implementation is Rust.
350+
The public package is Python. The cleanup logic lives in Rust, with a thin Python layer where using Polars' own public API is faster or more compatible.
349351

350352
The Rust code is split into three modules:
351353

rust/src/frame.rs

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,31 @@ pub fn compare_df_cols_same_schemas(schemas: &[FrameSchema]) -> JanitorResult<bo
161161
.all(|column| schema_column_matches(schemas, column)))
162162
}
163163

164-
fn rename_columns_df(df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
164+
fn rename_columns_df(mut df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
165+
let original = dataframe_column_names(&df);
166+
if can_use_rename_many(&original, &cleaned) {
167+
df.rename_many(
168+
original
169+
.iter()
170+
.zip(&cleaned)
171+
.filter(|(from, to)| from.as_str() != to.as_str())
172+
.map(|(from, to)| (from.as_str(), to.as_str().into())),
173+
)?;
174+
return Ok(df);
175+
}
176+
177+
rebuild_with_renamed_columns(df, cleaned)
178+
}
179+
180+
fn can_use_rename_many(original: &[String], cleaned: &[String]) -> bool {
181+
let original_names = original.iter().map(String::as_str).collect::<HashSet<_>>();
182+
original
183+
.iter()
184+
.zip(cleaned)
185+
.all(|(from, to)| from == to || !original_names.contains(to.as_str()))
186+
}
187+
188+
fn rebuild_with_renamed_columns(df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
165189
let height = df.height();
166190
let columns = df
167191
.columns()
@@ -679,6 +703,37 @@ mod tests {
679703
);
680704
}
681705

706+
#[test]
707+
fn rename_columns_falls_back_when_targets_exist_in_original_schema() {
708+
let df = df!("a" => [1, 2], "b" => [3, 4]).unwrap();
709+
710+
let result = rename_columns_df(df, vec![String::from("b"), String::from("a")]).unwrap();
711+
712+
assert_eq!(dataframe_column_names(&result), ["b", "a"]);
713+
assert_eq!(
714+
result
715+
.column("b")
716+
.unwrap()
717+
.as_materialized_series()
718+
.i32()
719+
.unwrap()
720+
.into_no_null_iter()
721+
.collect::<Vec<_>>(),
722+
[1, 2]
723+
);
724+
assert_eq!(
725+
result
726+
.column("a")
727+
.unwrap()
728+
.as_materialized_series()
729+
.i32()
730+
.unwrap()
731+
.into_no_null_iter()
732+
.collect::<Vec<_>>(),
733+
[3, 4]
734+
);
735+
}
736+
682737
#[test]
683738
fn compare_df_cols_reports_matches_and_mismatches() {
684739
let left = FrameSchema {

rust/src/names.rs

Lines changed: 101 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
use std::borrow::Cow;
12
use std::collections::{HashMap, HashSet};
23
use std::error::Error;
34
use std::fmt::{Display, Formatter};
@@ -60,6 +61,10 @@ fn parse_case(case: &str) -> Result<CaseStyle, NameError> {
6061
}
6162

6263
fn clean_one(name: &str, case: CaseStyle) -> String {
64+
if case == CaseStyle::Snake {
65+
return clean_one_snake(name);
66+
}
67+
6368
let tokens = tokens(name);
6469
let tokens = if tokens.is_empty() {
6570
vec!["x".to_string()]
@@ -70,12 +75,46 @@ fn clean_one(name: &str, case: CaseStyle) -> String {
7075
}
7176

7277
fn tokens(name: &str) -> Vec<String> {
73-
let translated = translate_common_symbols_and_letters(name.trim());
74-
let normalized = translated
75-
.nfkd()
76-
.filter(|character| !is_combining_mark(*character))
77-
.collect::<String>();
78-
tokenize_ascii(&normalized)
78+
let prepared = prepare_name(name);
79+
tokenize_ascii(&prepared)
80+
}
81+
82+
fn clean_one_snake(name: &str) -> String {
83+
let prepared = prepare_name(name);
84+
let mut cleaned = tokenize_ascii_to_snake(&prepared);
85+
if cleaned.is_empty() {
86+
return String::from("x");
87+
}
88+
if cleaned.starts_with(|character: char| character.is_ascii_digit()) {
89+
let mut prefixed = String::with_capacity(cleaned.len() + 2);
90+
prefixed.push_str("x_");
91+
prefixed.push_str(&cleaned);
92+
cleaned = prefixed;
93+
}
94+
cleaned
95+
}
96+
97+
fn prepare_name(name: &str) -> Cow<'_, str> {
98+
let trimmed = name.trim();
99+
if !trimmed.chars().any(needs_translation_or_normalization) {
100+
return Cow::Borrowed(trimmed);
101+
}
102+
103+
let translated = translate_common_symbols_and_letters(trimmed);
104+
if translated.is_ascii() {
105+
Cow::Owned(translated)
106+
} else {
107+
Cow::Owned(
108+
translated
109+
.nfkd()
110+
.filter(|character| !is_combining_mark(*character))
111+
.collect::<String>(),
112+
)
113+
}
114+
}
115+
116+
fn needs_translation_or_normalization(character: char) -> bool {
117+
!character.is_ascii() || matches!(character, '%' | '#' | '&' | '@' | '+')
79118
}
80119

81120
fn translate_common_symbols_and_letters(value: &str) -> String {
@@ -133,6 +172,40 @@ fn tokenize_ascii(value: &str) -> Vec<String> {
133172
output
134173
}
135174

175+
fn tokenize_ascii_to_snake(value: &str) -> String {
176+
let chars = value.chars().collect::<Vec<_>>();
177+
let mut output = String::with_capacity(value.len());
178+
let mut current_token_has_chars = false;
179+
let mut output_has_token = false;
180+
181+
for (index, character) in chars.iter().copied().enumerate() {
182+
if !character.is_ascii_alphanumeric() {
183+
current_token_has_chars = false;
184+
continue;
185+
}
186+
187+
if current_token_has_chars {
188+
let previous = chars[index - 1];
189+
let next = chars.get(index + 1).copied();
190+
if is_token_boundary(previous, character, next) {
191+
current_token_has_chars = false;
192+
}
193+
}
194+
195+
if !current_token_has_chars {
196+
if output_has_token {
197+
output.push('_');
198+
}
199+
output_has_token = true;
200+
current_token_has_chars = true;
201+
}
202+
203+
output.push(character.to_ascii_lowercase());
204+
}
205+
206+
output
207+
}
208+
136209
fn is_token_boundary(previous: char, current: char, next: Option<char>) -> bool {
137210
if previous.is_ascii_uppercase()
138211
&& current.is_ascii_uppercase()
@@ -188,17 +261,21 @@ fn ensure_identifier_start(name: &str, case: CaseStyle) -> String {
188261
if !name.starts_with(|character: char| character.is_ascii_digit()) {
189262
return name.to_string();
190263
}
191-
match case {
192-
CaseStyle::Constant => format!("X_{name}"),
193-
CaseStyle::Pascal => format!("X{name}"),
194-
CaseStyle::Camel => format!("x{name}"),
195-
CaseStyle::Snake => format!("x_{name}"),
196-
}
264+
let prefix = match case {
265+
CaseStyle::Constant => "X_",
266+
CaseStyle::Pascal => "X",
267+
CaseStyle::Camel => "x",
268+
CaseStyle::Snake => "x_",
269+
};
270+
let mut prefixed = String::with_capacity(prefix.len() + name.len());
271+
prefixed.push_str(prefix);
272+
prefixed.push_str(name);
273+
prefixed
197274
}
198275

199276
fn dedupe(names: &[String], case: CaseStyle) -> Vec<String> {
200-
let mut used = HashSet::new();
201-
let mut next_suffix_by_base: HashMap<&str, usize> = HashMap::new();
277+
let mut used = HashSet::with_capacity(names.len());
278+
let mut next_suffix_by_base: HashMap<&str, usize> = HashMap::with_capacity(names.len());
202279
let mut output = Vec::with_capacity(names.len());
203280

204281
for name in names {
@@ -219,10 +296,16 @@ fn dedupe(names: &[String], case: CaseStyle) -> Vec<String> {
219296
}
220297

221298
fn with_suffix(name: &str, suffix: usize, case: CaseStyle) -> String {
222-
match case {
223-
CaseStyle::Camel | CaseStyle::Pascal => format!("{name}{suffix}"),
224-
CaseStyle::Snake | CaseStyle::Constant => format!("{name}_{suffix}"),
225-
}
299+
let suffix = suffix.to_string();
300+
let separator = match case {
301+
CaseStyle::Camel | CaseStyle::Pascal => "",
302+
CaseStyle::Snake | CaseStyle::Constant => "_",
303+
};
304+
let mut suffixed = String::with_capacity(name.len() + separator.len() + suffix.len());
305+
suffixed.push_str(name);
306+
suffixed.push_str(separator);
307+
suffixed.push_str(&suffix);
308+
suffixed
226309
}
227310

228311
#[cfg(test)]

src/polars_janitor/__init__.py

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
from __future__ import annotations
44

5+
import polars as pl
6+
57
from polars_janitor._rust import (
6-
clean_names,
78
compare_df_cols,
89
compare_df_cols_same,
910
find_header,
@@ -14,6 +15,20 @@
1415
row_to_names,
1516
)
1617

18+
19+
def clean_names(frame: pl.DataFrame | pl.LazyFrame, *, case: str = "snake") -> pl.DataFrame | pl.LazyFrame:
20+
"""Clean column names using Rust name normalization and Polars' native rename path."""
21+
if isinstance(frame, pl.DataFrame):
22+
columns = frame.columns
23+
elif isinstance(frame, pl.LazyFrame):
24+
columns = frame.collect_schema().names()
25+
else:
26+
raise TypeError("frame must be a polars DataFrame or LazyFrame")
27+
28+
cleaned = make_clean_names(columns, case)
29+
return frame.rename(dict(zip(columns, cleaned, strict=True)))
30+
31+
1732
__all__ = [
1833
"clean_names",
1934
"compare_df_cols",

tests/test_frame.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,16 @@ def test_clean_names_supports_lazyframe() -> None:
3232
assert result.columns == ["customer_id", "order_id"]
3333

3434

35+
def test_clean_names_handles_targets_that_exist_in_original_schema() -> None:
36+
"""Public clean_names can rename into names that existed before cleaning."""
37+
df = pl.DataFrame({"a b": [1, 2], "a_b": [3, 4]})
38+
39+
result = pj.clean_names(df)
40+
41+
assert result.columns == ["a_b", "a_b_2"]
42+
assert result.to_dict(as_series=False) == {"a_b": [1, 2], "a_b_2": [3, 4]}
43+
44+
3545
def test_find_header_and_row_to_names_clean_messy_spreadsheet_rows() -> None:
3646
"""A discovered header row can be promoted to cleaned column names."""
3747
df = pl.DataFrame(

0 commit comments

Comments
 (0)