Skip to content

Commit 351aba3

Browse files
committed
Add comprehensive documentation structure
- Create GitHub Pages documentation with Jekyll - Add detailed API reference with all methods - Include statistical methods guide with mathematical foundations - Provide extensive examples and tutorials - Add installation guide with troubleshooting - Set up documentation infrastructure for GitHub Pages
1 parent a17295e commit 351aba3

File tree

6 files changed

+942
-0
lines changed

6 files changed

+942
-0
lines changed

docs/_config.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
title: StatClean Documentation
2+
description: Comprehensive statistical data preprocessing and outlier detection library
3+
theme: minima
4+
url: "https://subaashnair.github.io"
5+
baseurl: "/StatClean"
6+
7+
plugins:
8+
- jekyll-feed
9+
- jekyll-sitemap
10+
11+
markdown: kramdown
12+
highlighter: rouge
13+
14+
navigation:
15+
- title: Home
16+
url: /
17+
- title: API Reference
18+
url: /api-reference
19+
- title: Statistical Methods
20+
url: /statistical-methods
21+
- title: Examples
22+
url: /examples
23+
- title: Installation
24+
url: /installation
25+
26+
github:
27+
repository_url: "https://github.com/SubaashNair/StatClean"

docs/api-reference.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# API Reference
2+
3+
## StatClean Class
4+
5+
### Initialization
6+
7+
```python
8+
StatClean(df=None, preserve_index=True)
9+
```
10+
11+
**Parameters:**
12+
- `df` (pandas.DataFrame, optional): The DataFrame to clean
13+
- `preserve_index` (bool, default=True): Whether to preserve original index
14+
15+
### Detection Methods
16+
17+
#### `detect_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)`
18+
Detect outliers using the Interquartile Range method.
19+
20+
**Parameters:**
21+
- `column` (str): Column name to analyze
22+
- `lower_factor` (float): Lower bound multiplier for IQR
23+
- `upper_factor` (float): Upper bound multiplier for IQR
24+
25+
**Returns:**
26+
- `pandas.Series`: Boolean mask indicating outliers
27+
28+
#### `detect_outliers_zscore(column, threshold=3.0)`
29+
Detect outliers using Z-score method.
30+
31+
**Parameters:**
32+
- `column` (str): Column name to analyze
33+
- `threshold` (float): Z-score threshold for outlier detection
34+
35+
**Returns:**
36+
- `pandas.Series`: Boolean mask indicating outliers
37+
38+
#### `detect_outliers_modified_zscore(column, threshold=3.5)`
39+
Detect outliers using Modified Z-score (MAD-based) method.
40+
41+
**Parameters:**
42+
- `column` (str): Column name to analyze
43+
- `threshold` (float): Modified Z-score threshold
44+
45+
**Returns:**
46+
- `pandas.Series`: Boolean mask indicating outliers
47+
48+
#### `detect_outliers_mahalanobis(columns, chi2_threshold=0.95)`
49+
Detect multivariate outliers using Mahalanobis distance.
50+
51+
**Parameters:**
52+
- `columns` (list): List of column names for multivariate analysis
53+
- `chi2_threshold` (float): Chi-square threshold percentile
54+
55+
**Returns:**
56+
- `pandas.Series`: Boolean mask indicating outliers
57+
58+
### Treatment Methods
59+
60+
#### `remove_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)`
61+
Remove outliers using IQR method.
62+
63+
**Returns:**
64+
- `StatClean`: Self (enables method chaining)
65+
66+
#### `remove_outliers_zscore(column, threshold=3.0)`
67+
Remove outliers using Z-score method.
68+
69+
**Returns:**
70+
- `StatClean`: Self (enables method chaining)
71+
72+
#### `winsorize_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)`
73+
Cap outliers at IQR bounds instead of removing.
74+
75+
**Returns:**
76+
- `StatClean`: Self (enables method chaining)
77+
78+
### Statistical Testing
79+
80+
#### `grubbs_test(column, alpha=0.05, two_sided=True)`
81+
Perform Grubbs' test for outliers with statistical significance.
82+
83+
**Parameters:**
84+
- `column` (str): Column name to test
85+
- `alpha` (float): Significance level
86+
- `two_sided` (bool): Whether to perform two-sided test
87+
88+
**Returns:**
89+
- `dict`: Test results including p-value, test statistic, critical value
90+
91+
#### `dixon_q_test(column, alpha=0.05)`
92+
Perform Dixon's Q-test for small samples (n < 30).
93+
94+
**Parameters:**
95+
- `column` (str): Column name to test
96+
- `alpha` (float): Significance level
97+
98+
**Returns:**
99+
- `dict`: Test results including Q-statistic, critical value, p-value
100+
101+
### Data Transformations
102+
103+
#### `transform_boxcox(column, lambda_param=None)`
104+
Apply Box-Cox transformation with automatic lambda estimation.
105+
106+
**Parameters:**
107+
- `column` (str): Column name to transform
108+
- `lambda_param` (float, optional): Transformation parameter
109+
110+
**Returns:**
111+
- `dict`: Transformation results including optimal lambda
112+
113+
#### `recommend_transformation(column)`
114+
Automatically recommend best transformation based on distribution.
115+
116+
**Parameters:**
117+
- `column` (str): Column name to analyze
118+
119+
**Returns:**
120+
- `dict`: Recommendations including best transformation and improvement metrics
121+
122+
### Analysis Methods
123+
124+
#### `analyze_distribution(column)`
125+
Comprehensive distribution analysis with statistical tests.
126+
127+
**Parameters:**
128+
- `column` (str): Column name to analyze
129+
130+
**Returns:**
131+
- `dict`: Distribution analysis including skewness, kurtosis, normality test
132+
133+
#### `compare_methods(columns, methods=['iqr', 'zscore', 'modified_zscore'])`
134+
Compare agreement between different detection methods.
135+
136+
**Parameters:**
137+
- `columns` (list): Column names to compare
138+
- `methods` (list): Detection methods to compare
139+
140+
**Returns:**
141+
- `dict`: Method comparison results and agreement statistics
142+
143+
### Visualization
144+
145+
#### `plot_outlier_analysis(columns=None, figsize=(15, 5))`
146+
Generate comprehensive outlier analysis plots.
147+
148+
**Parameters:**
149+
- `columns` (list, optional): Columns to plot (defaults to all numeric)
150+
- `figsize` (tuple): Base figure size for each subplot
151+
152+
**Returns:**
153+
- `dict`: Dictionary of matplotlib figures keyed by column names
154+
155+
### Utility Methods
156+
157+
#### `get_outlier_stats(columns=None, include_indices=False)`
158+
Get comprehensive outlier statistics without removing data.
159+
160+
**Parameters:**
161+
- `columns` (list, optional): Columns to analyze
162+
- `include_indices` (bool): Whether to include outlier indices
163+
164+
**Returns:**
165+
- `pandas.DataFrame`: Statistics for each column and method
166+
167+
#### `set_thresholds(**kwargs)`
168+
Configure default thresholds for detection methods.
169+
170+
**Parameters:**
171+
- `iqr_lower_factor` (float): IQR lower bound multiplier
172+
- `iqr_upper_factor` (float): IQR upper bound multiplier
173+
- `zscore_threshold` (float): Z-score threshold
174+
- `modified_zscore_threshold` (float): Modified Z-score threshold
175+
176+
**Returns:**
177+
- `StatClean`: Self (enables method chaining)
178+
179+
## Utility Functions
180+
181+
### `plot_outliers(data, outliers, title="Outlier Analysis", figsize=(10, 6))`
182+
Create scatter plot highlighting outliers.
183+
184+
### `plot_distribution(data, outliers, title="Distribution Analysis", figsize=(10, 6))`
185+
Plot KDE distribution with outlier separation.
186+
187+
### `plot_boxplot(data, outliers, title="Box Plot Analysis", figsize=(10, 6))`
188+
Enhanced box plot with outlier overlay.
189+
190+
### `plot_qq(data, outliers, title="Q-Q Plot", figsize=(10, 6))`
191+
Q-Q plot for normality assessment.
192+
193+
### `plot_outlier_analysis(data, outliers, title="Comprehensive Analysis", figsize=(12, 10))`
194+
2x2 comprehensive analysis dashboard.

docs/examples.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# Examples
2+
3+
## Quick Start Example
4+
5+
```python
6+
import pandas as pd
7+
from statclean import StatClean
8+
9+
# Sample data with outliers
10+
df = pd.DataFrame({
11+
'income': [25000, 30000, 35000, 40000, 500000, 45000, 50000],
12+
'age': [25, 30, 35, 40, 35, 45, 50]
13+
})
14+
15+
# Initialize StatClean
16+
cleaner = StatClean(df)
17+
18+
# Basic outlier removal
19+
cleaner.remove_outliers_zscore('income')
20+
cleaned_df = cleaner.clean_df
21+
22+
print(f"Original shape: {df.shape}")
23+
print(f"Cleaned shape: {cleaned_df.shape}")
24+
```
25+
26+
## Statistical Testing Example
27+
28+
```python
29+
# Formal statistical testing
30+
grubbs_result = cleaner.grubbs_test('income', alpha=0.05)
31+
print(f"P-value: {grubbs_result['p_value']:.6f}")
32+
print(f"Outlier detected: {grubbs_result['outlier_detected']}")
33+
34+
# Dixon's Q-test for small samples
35+
dixon_result = cleaner.dixon_q_test('age', alpha=0.05)
36+
print(f"Q-statistic: {dixon_result['q_statistic']:.3f}")
37+
```
38+
39+
## Multivariate Analysis Example
40+
41+
```python
42+
# Mahalanobis distance for multivariate outliers
43+
outliers = cleaner.detect_outliers_mahalanobis(['income', 'age'])
44+
print(f"Multivariate outliers detected: {outliers.sum()}")
45+
46+
# Remove multivariate outliers
47+
cleaner.remove_outliers_mahalanobis(['income', 'age'])
48+
```
49+
50+
## Data Transformation Example
51+
52+
```python
53+
# Automatic transformation recommendation
54+
recommendation = cleaner.recommend_transformation('income')
55+
print(f"Best transformation: {recommendation['best_transformation']}")
56+
57+
# Apply Box-Cox transformation
58+
transformed = cleaner.transform_boxcox('income')
59+
print(f"Optimal lambda: {transformed['lambda']:.3f}")
60+
```
61+
62+
## Method Chaining Example
63+
64+
```python
65+
# Fluent API with method chaining
66+
result = (cleaner
67+
.set_thresholds(zscore_threshold=2.5)
68+
.add_zscore_columns(['income'])
69+
.winsorize_outliers_iqr('income')
70+
.clean_df)
71+
```
72+
73+
## Comprehensive Analysis Example
74+
75+
```python
76+
# Distribution analysis
77+
analysis = cleaner.analyze_distribution('income')
78+
print(f"Skewness: {analysis['skewness']:.3f}")
79+
print(f"Recommended method: {analysis['recommended_method']}")
80+
81+
# Compare detection methods
82+
comparison = cleaner.compare_methods(['income'])
83+
print("Method Agreement:")
84+
for method, stats in comparison['income']['method_stats'].items():
85+
print(f" {method}: {stats['outliers_detected']} outliers")
86+
```
87+
88+
## Visualization Example
89+
90+
```python
91+
import matplotlib.pyplot as plt
92+
93+
# Comprehensive analysis plots
94+
figures = cleaner.plot_outlier_analysis(['income', 'age'])
95+
96+
# Individual visualization components
97+
from statclean.utils import plot_outliers, plot_distribution
98+
99+
outliers = cleaner.detect_outliers_zscore('income')
100+
plot_outliers(df['income'], outliers, title='Income Analysis')
101+
plot_distribution(df['income'], outliers, title='Income Distribution')
102+
103+
plt.show()
104+
```
105+
106+
## Real Dataset Example
107+
108+
```python
109+
from sklearn.datasets import fetch_california_housing
110+
import pandas as pd
111+
from statclean import StatClean
112+
113+
# Load California Housing dataset
114+
housing = fetch_california_housing()
115+
df = pd.DataFrame(housing.data, columns=housing.feature_names)
116+
df['PRICE'] = housing.target
117+
118+
print(f"Dataset shape: {df.shape}")
119+
120+
# Initialize with index preservation
121+
cleaner = StatClean(df, preserve_index=True)
122+
123+
# Analyze key features
124+
features = ['MedInc', 'AveRooms', 'PRICE']
125+
for feature in features:
126+
analysis = cleaner.analyze_distribution(feature)
127+
print(f"\n{feature} Analysis:")
128+
print(f" Skewness: {analysis['skewness']:.3f}")
129+
print(f" Recommended method: {analysis['recommended_method']}")
130+
131+
# Comprehensive cleaning
132+
cleaned_df, info = cleaner.clean_columns(
133+
columns=features,
134+
method='auto',
135+
show_progress=True
136+
)
137+
138+
print(f"\nResults:")
139+
print(f"Original: {df.shape}")
140+
print(f"Cleaned: {cleaned_df.shape}")
141+
```
142+
143+
## Advanced Statistical Example
144+
145+
```python
146+
# Batch processing with detailed reporting
147+
columns_to_clean = ['MedInc', 'AveRooms', 'Population', 'PRICE']
148+
149+
# Get outlier statistics without removal
150+
stats = cleaner.get_outlier_stats(columns_to_clean, include_indices=True)
151+
print(stats)
152+
153+
# Apply custom cleaning strategy
154+
strategy = {
155+
'MedInc': {'method': 'modified_zscore', 'threshold': 3.0},
156+
'AveRooms': {'method': 'iqr', 'lower_factor': 2.0, 'upper_factor': 2.0},
157+
'Population': {'method': 'zscore', 'threshold': 2.5},
158+
'PRICE': {'method': 'auto'}
159+
}
160+
161+
cleaned_df = cleaner.apply_cleaning_strategy(strategy)
162+
163+
# Generate summary report
164+
report = cleaner.get_summary_report()
165+
print(report)
166+
```

0 commit comments

Comments
 (0)