|
2 | 2 |
|
3 | 3 | ## Home Page (Home.md) |
4 | 4 |
|
| 5 | +# Welcome to StatClean |
5 | 6 |
|
6 | 7 | StatClean is a comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting. |
7 | 8 | As of v0.1.3, remover methods return the cleaner instance for chaining; access results via `cleaner.clean_df` and `cleaner.outlier_info`. |
@@ -248,4 +249,117 @@ Instructions for setting up GitHub Wiki: |
248 | 249 | 3. Click "Create the first page" |
249 | 250 | 4. Copy the content above for each page |
250 | 251 | 5. Create pages with the exact names shown in parentheses |
251 | | -6. Set "Home" as the main wiki page |
| 252 | +6. Set "Home" as the main wiki page |
| 253 | + |
| 254 | +--- |
| 255 | + |
| 256 | +## Statistical Methods Guide (Statistical-Methods-Guide.md) |
| 257 | + |
| 258 | +# Statistical Methods Guide |
| 259 | + |
| 260 | +## Univariate Methods |
| 261 | +- IQR: robust to non-normal data; configure lower/upper factors. |
| 262 | +- Z-score: assumes approximate normality; configurable threshold. |
| 263 | +- Modified Z-score: robust via MAD; default threshold 3.5. |
| 264 | + |
| 265 | +## Multivariate Methods |
| 266 | +- Mahalanobis distance: detects multivariate outliers using covariance structure. |
| 267 | + - `chi2_threshold`: percentile in (0,1] or absolute chi-square statistic. |
| 268 | + - `use_shrinkage=True` to enable Ledoit–Wolf covariance if scikit-learn available. |
| 269 | + |
| 270 | +## Formal Tests |
| 271 | +- Grubbs' test: single outlier detection with p-value and critical value. |
| 272 | +- Dixon's Q-test: for small n (<30); approximate p-value reported. |
| 273 | + |
| 274 | +## Transformations |
| 275 | +- Box-Cox (positive data): optimal lambda estimated; preserves NaNs. |
| 276 | +- Log (natural, base 10, base 2): shifts applied for non-positive values. |
| 277 | +- Square-root: shifts applied for negatives. |
| 278 | + |
| 279 | +Best practices: drop NaNs before tests where needed; sample large data for Shapiro. |
| 280 | + |
| 281 | +--- |
| 282 | + |
| 283 | +## API Reference (API-Reference.md) |
| 284 | + |
| 285 | +# API Reference |
| 286 | + |
| 287 | +## Core Class |
| 288 | +- `StatClean(df: DataFrame, preserve_index: bool = True)` |
| 289 | + - `set_data`, `set_thresholds`, `get_thresholds`, `reset`, `get_summary_report` |
| 290 | + |
| 291 | +## Detection (non-destructive) |
| 292 | +- `detect_outliers_iqr(column, lower_factor=None, upper_factor=None)` → Series |
| 293 | +- `detect_outliers_zscore(column, threshold=None)` → Series |
| 294 | +- `detect_outliers_modified_zscore(column, threshold=None)` → Series |
| 295 | +- `detect_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → Series |
| 296 | + |
| 297 | +## Removal / Winsorizing (chained) |
| 298 | +- `remove_outliers_iqr(column, ...)` → self |
| 299 | +- `remove_outliers_zscore(column, threshold=None)` → self |
| 300 | +- `remove_outliers_modified_zscore(column, threshold=None)` → self |
| 301 | +- `remove_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → self |
| 302 | +- `winsorize_outliers_iqr(column, ...)` → self |
| 303 | +- `winsorize_outliers_zscore(column, threshold=None)` → self |
| 304 | +- `winsorize_outliers_percentile(column, lower_percentile=5, upper_percentile=95)` → self |
| 305 | + |
| 306 | +## Analysis & Utilities |
| 307 | +- `analyze_distribution(column)` → dict (skewness, kurtosis, normality, recommendation) |
| 308 | +- `compare_methods(columns=None, methods=None, ...)` → dict summary |
| 309 | +- `get_outlier_stats(columns=None, methods=['iqr','zscore'], ...)` → DataFrame |
| 310 | +- `plot_outlier_analysis(columns=None, methods=None, figsize=(15,5))` → dict[str, Figure] |
| 311 | +- `visualize_outliers(column)` → None |
| 312 | + |
| 313 | +## Transformations |
| 314 | +- `transform_boxcox(column, lambda_param=None)` → (self, info) |
| 315 | +- `transform_log(column, base='natural')` → (self, info) |
| 316 | +- `transform_sqrt(column)` → (self, info) |
| 317 | +- `recommend_transformation(column)` → dict |
| 318 | + |
| 319 | +## Utils (module `statclean.utils`) |
| 320 | +- `plot_outliers(series, outliers_mask, title=None)` |
| 321 | +- `plot_distribution(series, outliers_mask=None, title=None)` |
| 322 | +- `plot_boxplot(series, title=None)` |
| 323 | +- `plot_qq(series, outliers_mask=None, title=None)` |
| 324 | +- `plot_outlier_analysis(data, outliers=None)` |
| 325 | + |
| 326 | +Notes: |
| 327 | +- Remover methods return `self` for chaining; access data via `cleaner.clean_df`. |
| 328 | +- Mahalanobis supports percentile thresholds and shrinkage covariance. |
| 329 | + |
| 330 | +--- |
| 331 | + |
| 332 | +## Advanced Examples (Advanced-Examples.md) |
| 333 | + |
| 334 | +# Advanced Examples |
| 335 | + |
| 336 | +### California Housing End-to-End |
| 337 | +```python |
| 338 | +import pandas as pd |
| 339 | +from sklearn.datasets import fetch_california_housing |
| 340 | +from statclean import StatClean |
| 341 | + |
| 342 | +housing = fetch_california_housing() |
| 343 | +df = pd.DataFrame(housing.data, columns=housing.feature_names) |
| 344 | +df['PRICE'] = housing.target |
| 345 | + |
| 346 | +cleaner = StatClean(df, preserve_index=True) |
| 347 | + |
| 348 | +# Analyze & clean selected features |
| 349 | +features = ['MedInc', 'AveRooms', 'PRICE'] |
| 350 | +cleaned_df, info = cleaner.clean_columns(features, method='auto', show_progress=True) |
| 351 | + |
| 352 | +# Multivariate check |
| 353 | +mv_outliers = cleaner.detect_outliers_mahalanobis(['MedInc', 'AveRooms', 'PRICE'], chi2_threshold=0.975) |
| 354 | +print('Multivariate outliers:', mv_outliers.sum()) |
| 355 | + |
| 356 | +# Visualization grid |
| 357 | +figs = cleaner.plot_outlier_analysis(features) |
| 358 | +``` |
| 359 | + |
| 360 | +### Modified Z-score Visualization |
| 361 | +```python |
| 362 | +outliers = cleaner.detect_outliers_modified_zscore('PRICE') |
| 363 | +cleaner.remove_outliers_modified_zscore('PRICE') |
| 364 | +cleaner.visualize_outliers('PRICE') |
| 365 | +``` |
0 commit comments