Skip to content

Commit 2a419f5

Browse files
committed
docs(wiki): restructure and expand wiki content
1 parent ff32ad4 commit 2a419f5

File tree

1 file changed

+115
-1
lines changed

1 file changed

+115
-1
lines changed

wiki_content.md

Lines changed: 115 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## Home Page (Home.md)
44

5+
# Welcome to StatClean
56

67
StatClean is a comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting.
78
As of v0.1.3, remover methods return the cleaner instance for chaining; access results via `cleaner.clean_df` and `cleaner.outlier_info`.
@@ -248,4 +249,117 @@ Instructions for setting up GitHub Wiki:
248249
3. Click "Create the first page"
249250
4. Copy the content above for each page
250251
5. Create pages with the exact names shown in parentheses
251-
6. Set "Home" as the main wiki page
252+
6. Set "Home" as the main wiki page
253+
254+
---
255+
256+
## Statistical Methods Guide (Statistical-Methods-Guide.md)
257+
258+
# Statistical Methods Guide
259+
260+
## Univariate Methods
261+
- IQR: robust to non-normal data; configure lower/upper factors.
262+
- Z-score: assumes approximate normality; configurable threshold.
263+
- Modified Z-score: robust via MAD; default threshold 3.5.
264+
265+
## Multivariate Methods
266+
- Mahalanobis distance: detects multivariate outliers using covariance structure.
267+
- `chi2_threshold`: percentile in (0,1] or absolute chi-square statistic.
268+
- `use_shrinkage=True` to enable Ledoit–Wolf covariance if scikit-learn available.
269+
270+
## Formal Tests
271+
- Grubbs' test: single outlier detection with p-value and critical value.
272+
- Dixon's Q-test: for small n (<30); approximate p-value reported.
273+
274+
## Transformations
275+
- Box-Cox (positive data): optimal lambda estimated; preserves NaNs.
276+
- Log (natural, base 10, base 2): shifts applied for non-positive values.
277+
- Square-root: shifts applied for negatives.
278+
279+
Best practices: drop NaNs before tests where needed; sample large data for Shapiro.
280+
281+
---
282+
283+
## API Reference (API-Reference.md)
284+
285+
# API Reference
286+
287+
## Core Class
288+
- `StatClean(df: DataFrame, preserve_index: bool = True)`
289+
- `set_data`, `set_thresholds`, `get_thresholds`, `reset`, `get_summary_report`
290+
291+
## Detection (non-destructive)
292+
- `detect_outliers_iqr(column, lower_factor=None, upper_factor=None)` → Series
293+
- `detect_outliers_zscore(column, threshold=None)` → Series
294+
- `detect_outliers_modified_zscore(column, threshold=None)` → Series
295+
- `detect_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → Series
296+
297+
## Removal / Winsorizing (chained)
298+
- `remove_outliers_iqr(column, ...)` → self
299+
- `remove_outliers_zscore(column, threshold=None)` → self
300+
- `remove_outliers_modified_zscore(column, threshold=None)` → self
301+
- `remove_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → self
302+
- `winsorize_outliers_iqr(column, ...)` → self
303+
- `winsorize_outliers_zscore(column, threshold=None)` → self
304+
- `winsorize_outliers_percentile(column, lower_percentile=5, upper_percentile=95)` → self
305+
306+
## Analysis & Utilities
307+
- `analyze_distribution(column)` → dict (skewness, kurtosis, normality, recommendation)
308+
- `compare_methods(columns=None, methods=None, ...)` → dict summary
309+
- `get_outlier_stats(columns=None, methods=['iqr','zscore'], ...)` → DataFrame
310+
- `plot_outlier_analysis(columns=None, methods=None, figsize=(15,5))` → dict[str, Figure]
311+
- `visualize_outliers(column)` → None
312+
313+
## Transformations
314+
- `transform_boxcox(column, lambda_param=None)` → (self, info)
315+
- `transform_log(column, base='natural')` → (self, info)
316+
- `transform_sqrt(column)` → (self, info)
317+
- `recommend_transformation(column)` → dict
318+
319+
## Utils (module `statclean.utils`)
320+
- `plot_outliers(series, outliers_mask, title=None)`
321+
- `plot_distribution(series, outliers_mask=None, title=None)`
322+
- `plot_boxplot(series, title=None)`
323+
- `plot_qq(series, outliers_mask=None, title=None)`
324+
- `plot_outlier_analysis(data, outliers=None)`
325+
326+
Notes:
327+
- Remover methods return `self` for chaining; access data via `cleaner.clean_df`.
328+
- Mahalanobis supports percentile thresholds and shrinkage covariance.
329+
330+
---
331+
332+
## Advanced Examples (Advanced-Examples.md)
333+
334+
# Advanced Examples
335+
336+
### California Housing End-to-End
337+
```python
338+
import pandas as pd
339+
from sklearn.datasets import fetch_california_housing
340+
from statclean import StatClean
341+
342+
housing = fetch_california_housing()
343+
df = pd.DataFrame(housing.data, columns=housing.feature_names)
344+
df['PRICE'] = housing.target
345+
346+
cleaner = StatClean(df, preserve_index=True)
347+
348+
# Analyze & clean selected features
349+
features = ['MedInc', 'AveRooms', 'PRICE']
350+
cleaned_df, info = cleaner.clean_columns(features, method='auto', show_progress=True)
351+
352+
# Multivariate check
353+
mv_outliers = cleaner.detect_outliers_mahalanobis(['MedInc', 'AveRooms', 'PRICE'], chi2_threshold=0.975)
354+
print('Multivariate outliers:', mv_outliers.sum())
355+
356+
# Visualization grid
357+
figs = cleaner.plot_outlier_analysis(features)
358+
```
359+
360+
### Modified Z-score Visualization
361+
```python
362+
outliers = cleaner.detect_outliers_modified_zscore('PRICE')
363+
cleaner.remove_outliers_modified_zscore('PRICE')
364+
cleaner.visualize_outliers('PRICE')
365+
```

0 commit comments

Comments
 (0)