Optimize clean_names and refresh benchmark docs

Vladimir Vilimaitis · Vladimir Vilimaitis · commit 26f438d9f8e5 · 2026-05-18T01:42:52.000+02:00
diff --git a/README.md b/README.md
@@ -10,6 +10,8 @@ Polars already has a strong API. Most cleanup work should stay plain Polars.
 
 The rough spots this package tries to smooth out are the ones that show up around messy inputs: awkward column names from CSVs and spreadsheets, header rows hiding inside spreadsheet data, empty rows, all-null columns, constant columns, duplicate records by key, and quick schema checks before you combine frames. Those are janitorial jobs. They are not glamorous, but they happen often enough to deserve a small, sharp tool.
 
+`polars-janitor` owes a lot to R's janitor and pyjanitor. Those projects made the case that small cleanup helpers are worth having. This package borrows that spirit, but keeps the API narrow and Polars-shaped.
+
 The package does not register a dataframe namespace. Import it next to Polars:
 
 ```python
@@ -319,21 +321,21 @@ LazyFrame support is deliberately conservative. `clean_names`, `remove_empty(...
 
 The package supports Python Polars `1.29.0` and newer. Compatibility tests run against that lower bound and the current lockfile version.
 
-The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Eager frames cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build.
+The project favors broad Python Polars compatibility over direct Rust deserialization of Python lazy plans. Most eager frame helpers cross through `pyo3-polars`; lazy frames keep their plans in Python Polars, with Rust deciding what public Polars plan to build. `clean_names` is a little different: Rust cleans the names, then Polars' public `rename` API applies them.
 
 The compiled extension is CPython-version-specific. If `import polars_janitor` fails after changing Python versions, rebuild with `maturin develop --release` or reinstall from the wheel for that interpreter.
 
 ## Benchmarks
 
-These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim.
+These are local medians from this Windows x64 machine using CPython 3.13.5, Polars 1.40.1, pyjanitor 0.32.23 with pandas 3.0.3, and R 4.6.0 with janitor 2.2.1. Setup is outside the timed loop. Treat them as directional, not as a universal performance claim or a dunk contest.
 
 The R comparison uses base R `data.frame`s because janitor is a data.frame/tibble package. pyjanitor has Polars methods for `clean_names` and `row_to_names`, so those are shown separately. Its `compare_df_cols` helper is pandas-only in the tested version.
 
 | Task | Size | polars-janitor | pyjanitor/Polars | pyjanitor/pandas | R janitor |
 | --- | ---: | ---: | ---: | ---: | ---: |
-| clean_names | 10,000 columns | 45.49 ms | 139.01 ms | 36.94 ms | 5690.00 ms |
-| compare_df_cols | 5,000 columns | 14.47 ms | n/a | 384.17 ms | 80.00 ms |
-| row_to_names + clean_names | 2,000 columns | 8.78 ms | 32.13 ms | 44.04 ms | 970.00 ms |
+| clean_names | 10,000 columns | 14.25 ms | 159.68 ms | 38.27 ms | 5030.00 ms |
+| compare_df_cols | 5,000 columns | 15.53 ms | n/a | 277.58 ms | 70.00 ms |
+| row_to_names + clean_names | 2,000 columns | 8.39 ms | 32.45 ms | 44.29 ms | 950.00 ms |
 
 Run the same benchmark from a checkout:
 
@@ -345,7 +347,7 @@ If R is installed and the `janitor` package is available to that R installation,
 
 ## Rust implementation
 
-The public package is Python, but the implementation is Rust.
+The public package is Python. The cleanup logic lives in Rust, with a thin Python layer where using Polars' own public API is faster or more compatible.
 
 The Rust code is split into three modules:
 
diff --git a/rust/src/frame.rs b/rust/src/frame.rs
@@ -161,7 +161,31 @@ pub fn compare_df_cols_same_schemas(schemas: &[FrameSchema]) -> JanitorResult<bo
         .all(|column| schema_column_matches(schemas, column)))
 }
 
-fn rename_columns_df(df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
+fn rename_columns_df(mut df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
+    let original = dataframe_column_names(&df);
+    if can_use_rename_many(&original, &cleaned) {
+        df.rename_many(
+            original
+                .iter()
+                .zip(&cleaned)
+                .filter(|(from, to)| from.as_str() != to.as_str())
+                .map(|(from, to)| (from.as_str(), to.as_str().into())),
+        )?;
+        return Ok(df);
+    }
+
+    rebuild_with_renamed_columns(df, cleaned)
+}
+
+fn can_use_rename_many(original: &[String], cleaned: &[String]) -> bool {
+    let original_names = original.iter().map(String::as_str).collect::<HashSet<_>>();
+    original
+        .iter()
+        .zip(cleaned)
+        .all(|(from, to)| from == to || !original_names.contains(to.as_str()))
+}
+
+fn rebuild_with_renamed_columns(df: DataFrame, cleaned: Vec<String>) -> JanitorResult<DataFrame> {
     let height = df.height();
     let columns = df
         .columns()
@@ -679,6 +703,37 @@ mod tests {
         );
     }
 
+    #[test]
+    fn rename_columns_falls_back_when_targets_exist_in_original_schema() {
+        let df = df!("a" => [1, 2], "b" => [3, 4]).unwrap();
+
+        let result = rename_columns_df(df, vec![String::from("b"), String::from("a")]).unwrap();
+
+        assert_eq!(dataframe_column_names(&result), ["b", "a"]);
+        assert_eq!(
+            result
+                .column("b")
+                .unwrap()
+                .as_materialized_series()
+                .i32()
+                .unwrap()
+                .into_no_null_iter()
+                .collect::<Vec<_>>(),
+            [1, 2]
+        );
+        assert_eq!(
+            result
+                .column("a")
+                .unwrap()
+                .as_materialized_series()
+                .i32()
+                .unwrap()
+                .into_no_null_iter()
+                .collect::<Vec<_>>(),
+            [3, 4]
+        );
+    }
+
     #[test]
     fn compare_df_cols_reports_matches_and_mismatches() {
         let left = FrameSchema {
diff --git a/rust/src/names.rs b/rust/src/names.rs
@@ -1,3 +1,4 @@
+use std::borrow::Cow;
 use std::collections::{HashMap, HashSet};
 use std::error::Error;
 use std::fmt::{Display, Formatter};
@@ -60,6 +61,10 @@ fn parse_case(case: &str) -> Result<CaseStyle, NameError> {
 }
 
 fn clean_one(name: &str, case: CaseStyle) -> String {
+    if case == CaseStyle::Snake {
+        return clean_one_snake(name);
+    }
+
     let tokens = tokens(name);
     let tokens = if tokens.is_empty() {
         vec!["x".to_string()]
@@ -70,12 +75,46 @@ fn clean_one(name: &str, case: CaseStyle) -> String {
 }
 
 fn tokens(name: &str) -> Vec<String> {
-    let translated = translate_common_symbols_and_letters(name.trim());
-    let normalized = translated
-        .nfkd()
-        .filter(|character| !is_combining_mark(*character))
-        .collect::<String>();
-    tokenize_ascii(&normalized)
+    let prepared = prepare_name(name);
+    tokenize_ascii(&prepared)
+}
+
+fn clean_one_snake(name: &str) -> String {
+    let prepared = prepare_name(name);
+    let mut cleaned = tokenize_ascii_to_snake(&prepared);
+    if cleaned.is_empty() {
+        return String::from("x");
+    }
+    if cleaned.starts_with(|character: char| character.is_ascii_digit()) {
+        let mut prefixed = String::with_capacity(cleaned.len() + 2);
+        prefixed.push_str("x_");
+        prefixed.push_str(&cleaned);
+        cleaned = prefixed;
+    }
+    cleaned
+}
+
+fn prepare_name(name: &str) -> Cow<'_, str> {
+    let trimmed = name.trim();
+    if !trimmed.chars().any(needs_translation_or_normalization) {
+        return Cow::Borrowed(trimmed);
+    }
+
+    let translated = translate_common_symbols_and_letters(trimmed);
+    if translated.is_ascii() {
+        Cow::Owned(translated)
+    } else {
+        Cow::Owned(
+            translated
+                .nfkd()
+                .filter(|character| !is_combining_mark(*character))
+                .collect::<String>(),
+        )
+    }
+}
+
+fn needs_translation_or_normalization(character: char) -> bool {
+    !character.is_ascii() || matches!(character, '%' | '#' | '&' | '@' | '+')
 }
 
 fn translate_common_symbols_and_letters(value: &str) -> String {
@@ -133,6 +172,40 @@ fn tokenize_ascii(value: &str) -> Vec<String> {
     output
 }
 
+fn tokenize_ascii_to_snake(value: &str) -> String {
+    let chars = value.chars().collect::<Vec<_>>();
+    let mut output = String::with_capacity(value.len());
+    let mut current_token_has_chars = false;
+    let mut output_has_token = false;
+
+    for (index, character) in chars.iter().copied().enumerate() {
+        if !character.is_ascii_alphanumeric() {
+            current_token_has_chars = false;
+            continue;
+        }
+
+        if current_token_has_chars {
+            let previous = chars[index - 1];
+            let next = chars.get(index + 1).copied();
+            if is_token_boundary(previous, character, next) {
+                current_token_has_chars = false;
+            }
+        }
+
+        if !current_token_has_chars {
+            if output_has_token {
+                output.push('_');
+            }
+            output_has_token = true;
+            current_token_has_chars = true;
+        }
+
+        output.push(character.to_ascii_lowercase());
+    }
+
+    output
+}
+
 fn is_token_boundary(previous: char, current: char, next: Option<char>) -> bool {
     if previous.is_ascii_uppercase()
         && current.is_ascii_uppercase()
@@ -188,17 +261,21 @@ fn ensure_identifier_start(name: &str, case: CaseStyle) -> String {
     if !name.starts_with(|character: char| character.is_ascii_digit()) {
         return name.to_string();
     }
-    match case {
-        CaseStyle::Constant => format!("X_{name}"),
-        CaseStyle::Pascal => format!("X{name}"),
-        CaseStyle::Camel => format!("x{name}"),
-        CaseStyle::Snake => format!("x_{name}"),
-    }
+    let prefix = match case {
+        CaseStyle::Constant => "X_",
+        CaseStyle::Pascal => "X",
+        CaseStyle::Camel => "x",
+        CaseStyle::Snake => "x_",
+    };
+    let mut prefixed = String::with_capacity(prefix.len() + name.len());
+    prefixed.push_str(prefix);
+    prefixed.push_str(name);
+    prefixed
 }
 
 fn dedupe(names: &[String], case: CaseStyle) -> Vec<String> {
-    let mut used = HashSet::new();
-    let mut next_suffix_by_base: HashMap<&str, usize> = HashMap::new();
+    let mut used = HashSet::with_capacity(names.len());
+    let mut next_suffix_by_base: HashMap<&str, usize> = HashMap::with_capacity(names.len());
     let mut output = Vec::with_capacity(names.len());
 
     for name in names {
@@ -219,10 +296,16 @@ fn dedupe(names: &[String], case: CaseStyle) -> Vec<String> {
 }
 
 fn with_suffix(name: &str, suffix: usize, case: CaseStyle) -> String {
-    match case {
-        CaseStyle::Camel | CaseStyle::Pascal => format!("{name}{suffix}"),
-        CaseStyle::Snake | CaseStyle::Constant => format!("{name}_{suffix}"),
-    }
+    let suffix = suffix.to_string();
+    let separator = match case {
+        CaseStyle::Camel | CaseStyle::Pascal => "",
+        CaseStyle::Snake | CaseStyle::Constant => "_",
+    };
+    let mut suffixed = String::with_capacity(name.len() + separator.len() + suffix.len());
+    suffixed.push_str(name);
+    suffixed.push_str(separator);
+    suffixed.push_str(&suffix);
+    suffixed
 }
 
 #[cfg(test)]
diff --git a/src/polars_janitor/__init__.py b/src/polars_janitor/__init__.py
@@ -2,8 +2,9 @@
 
 from __future__ import annotations
 
+import polars as pl
+
 from polars_janitor._rust import (
-    clean_names,
     compare_df_cols,
     compare_df_cols_same,
     find_header,
@@ -14,6 +15,20 @@
     row_to_names,
 )
 
+
+def clean_names(frame: pl.DataFrame | pl.LazyFrame, *, case: str = "snake") -> pl.DataFrame | pl.LazyFrame:
+    """Clean column names using Rust name normalization and Polars' native rename path."""
+    if isinstance(frame, pl.DataFrame):
+        columns = frame.columns
+    elif isinstance(frame, pl.LazyFrame):
+        columns = frame.collect_schema().names()
+    else:
+        raise TypeError("frame must be a polars DataFrame or LazyFrame")
+
+    cleaned = make_clean_names(columns, case)
+    return frame.rename(dict(zip(columns, cleaned, strict=True)))
+
+
 __all__ = [
     "clean_names",
     "compare_df_cols",
diff --git a/tests/test_frame.py b/tests/test_frame.py
@@ -32,6 +32,16 @@ def test_clean_names_supports_lazyframe() -> None:
     assert result.columns == ["customer_id", "order_id"]
 
 
+def test_clean_names_handles_targets_that_exist_in_original_schema() -> None:
+    """Public clean_names can rename into names that existed before cleaning."""
+    df = pl.DataFrame({"a b": [1, 2], "a_b": [3, 4]})
+
+    result = pj.clean_names(df)
+
+    assert result.columns == ["a_b", "a_b_2"]
+    assert result.to_dict(as_series=False) == {"a_b": [1, 2], "a_b_2": [3, 4]}
+
+
 def test_find_header_and_row_to_names_clean_messy_spreadsheet_rows() -> None:
     """A discovered header row can be promoted to cleaned column names."""
     df = pl.DataFrame(