Skip to content

Commit a17295e

Browse files
committed
Initial release of StatClean v0.1.0
Complete rebranding from OutlierCleaner to StatClean with expanded statistical capabilities: Features: - Formal Statistical Testing: Grubbs' test, Dixon's Q-test with p-values - Multivariate Analysis: Mahalanobis distance outlier detection - Data Transformations: Box-Cox, logarithmic, square-root with auto recommendations - Method Chaining: Fluent API for streamlined statistical workflows - Publication-Quality Reporting: Statistical significance testing and effect sizes Enhanced Functionality: - Advanced Distribution Analysis: Automatic normality testing and method recommendations - Batch Processing: Multi-column processing with progress tracking - Comprehensive Visualization: 3-in-1 analysis plots and standalone plotting functions - Type Safety: Complete type annotations for enhanced IDE support
0 parents  commit a17295e

File tree

15 files changed

+3809
-0
lines changed

15 files changed

+3809
-0
lines changed

.github/workflows/publish.yml

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
name: Publish to PyPI and GitHub Release
2+
3+
on:
4+
push:
5+
tags:
6+
- 'v*'
7+
8+
permissions:
9+
contents: write
10+
11+
jobs:
12+
deploy:
13+
runs-on: ubuntu-latest
14+
permissions:
15+
id-token: write # IMPORTANT: this permission is mandatory for trusted publishing
16+
contents: write
17+
steps:
18+
- uses: actions/checkout@v3
19+
20+
- name: Set up Python
21+
uses: actions/setup-python@v4
22+
with:
23+
python-version: '3.x'
24+
25+
- name: Install dependencies
26+
run: |
27+
python -m pip install --upgrade pip
28+
pip install build
29+
30+
- name: Create GitHub Release
31+
uses: softprops/action-gh-release@v1
32+
env:
33+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
34+
with:
35+
name: Release ${{ github.ref_name }}
36+
body: |
37+
StatClean ${{ github.ref_name }} - A comprehensive statistical data preprocessing and outlier detection library.
38+
39+
See [CHANGELOG.md](https://github.com/SubaashNair/StatClean/blob/main/CHANGELOG.md) for detailed release notes.
40+
41+
## Installation
42+
```
43+
pip install statclean==${{ github.ref_name }}
44+
```
45+
46+
- name: Build package
47+
run: python -m build
48+
49+
- name: Publish package distributions to PyPI
50+
uses: pypa/gh-action-pypi-publish@release/v1

.gitignore

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual Environment
24+
venv/
25+
env/
26+
ENV/
27+
28+
# IDE
29+
.idea/
30+
.vscode/
31+
*.swp
32+
*.swo
33+
34+
# Distribution
35+
dist/
36+
build/
37+
*.egg-info/
38+
39+
# Misc
40+
.DS_Store
41+
.env

CHANGELOG.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Changelog
2+
3+
All notable changes to StatClean will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [0.1.0] - 2025-08-06
9+
10+
### 🎉 Initial Release of StatClean
11+
12+
This marks the initial public release of StatClean, a comprehensive statistical data preprocessing and outlier detection library. The project has been completely rebranded from OutlierCleaner to StatClean with expanded statistical capabilities.
13+
14+
### Added
15+
16+
#### **Formal Statistical Testing**
17+
- **Grubbs' Test**: Single outlier detection with p-values and critical values
18+
- **Dixon's Q-Test**: Outlier detection for small samples (n < 30) with statistical significance
19+
- **Distribution Analysis**: Automatic normality testing using Shapiro-Wilk test
20+
- **Statistical Validation**: P-values, confidence intervals, and critical value calculations
21+
22+
#### **Detection Methods**
23+
- **Univariate Methods**:
24+
- IQR (Interquartile Range) with configurable factors
25+
- Z-score method with customizable thresholds
26+
- Modified Z-score using MAD (robust to non-normal distributions)
27+
- **Multivariate Methods**:
28+
- Mahalanobis distance outlier detection with chi-square thresholds
29+
- **Batch Detection**: Multi-column outlier detection with progress tracking
30+
31+
#### **Treatment Options**
32+
- **Outlier Removal**: Remove detected outliers with statistical validation
33+
- **Winsorizing**: Cap outliers at specified bounds instead of removal
34+
- IQR-based winsorizing
35+
- Z-score based winsorizing
36+
- Percentile-based winsorizing
37+
- **Data Transformations**:
38+
- Box-Cox transformation with automatic lambda estimation
39+
- Logarithmic transformations (natural, base 10, base 2)
40+
- Square root transformation
41+
- Automatic transformation recommendation based on distribution analysis
42+
43+
#### **Advanced Visualization**
44+
- **Comprehensive Analysis Plots**: 3-in-1 analysis dashboard (boxplot, distribution, Q-Q plot)
45+
- **Standalone Plotting Functions**:
46+
- `plot_outliers()`: Scatter plots with outlier highlighting
47+
- `plot_distribution()`: KDE distribution plots with outlier separation
48+
- `plot_boxplot()`: Enhanced box plots with outlier overlay
49+
- `plot_qq()`: Q-Q plots for normality assessment
50+
- `plot_outlier_analysis()`: 2x2 comprehensive analysis grid
51+
- **Publication-Ready Figures**: Professional styling with customizable parameters
52+
53+
#### **Developer Experience Features**
54+
- **Method Chaining**: Fluent API enabling streamlined workflows
55+
- **Type Safety**: Comprehensive type hints for enhanced IDE support and error detection
56+
- **Progress Tracking**: Built-in progress bars using tqdm for batch operations
57+
- **Flexible Configuration**: Customizable thresholds and statistical parameters
58+
- **Memory Efficiency**: Statistics caching and lazy evaluation for performance
59+
60+
#### **Analysis and Reporting**
61+
- **Distribution Analysis**: Comprehensive statistical analysis including:
62+
- Skewness and kurtosis calculation
63+
- Automatic method recommendation based on distribution characteristics
64+
- Normality testing with statistical significance
65+
- **Method Comparison**: Statistical agreement analysis between different detection methods
66+
- **Batch Processing**: Multi-column processing with detailed reporting and progress tracking
67+
- **Summary Reports**: Publication-quality statistical summaries
68+
69+
#### **Utility Features**
70+
- **Index Preservation**: Configurable index handling during data cleaning
71+
- **Missing Value Handling**: Robust handling of NaN values and edge cases
72+
- **Data Validation**: Automatic data type validation and error handling
73+
- **Statistics Caching**: Efficient caching for repeated statistical operations
74+
75+
### Technical
76+
77+
#### **Core Architecture**
78+
- **StatClean Class**: Main class with 40+ methods for comprehensive statistical preprocessing
79+
- **Modular Design**: Separate utils module for standalone visualization functions
80+
- **Robust Error Handling**: Comprehensive edge case handling for statistical computations
81+
- **Performance Optimization**: Lazy evaluation and efficient memory usage
82+
83+
#### **Dependencies**
84+
- **Core**: numpy, pandas, matplotlib, seaborn, scipy
85+
- **Statistical**: scipy for advanced statistical tests and distributions
86+
- **Progress**: tqdm for user-friendly progress tracking
87+
- **Development**: Complete type annotations for static analysis
88+
89+
#### **API Design**
90+
- **Intuitive Interface**: Clear method naming and consistent parameter patterns
91+
- **Flexible Parameters**: Configurable thresholds and statistical significance levels
92+
- **Return Types**: Comprehensive return dictionaries with statistical metadata
93+
- **Documentation**: Extensive docstrings with mathematical explanations
94+
95+
### Package Information
96+
97+
- **Package Name**: `statclean` (renamed from `outlier-cleaner`)
98+
- **Main Class**: `StatClean` (renamed from `OutlierCleaner`)
99+
- **Backward Compatibility**: Alias maintained for transition (`OutlierCleaner = StatClean`)
100+
- **Version**: 0.1.0 (semantic versioning reset for new package)
101+
- **Python Support**: ≥3.7
102+
- **Development Status**: Production/Stable
103+
104+
### Migration Notes
105+
106+
For users migrating from OutlierCleaner:
107+
108+
#### **Package Installation**
109+
```bash
110+
# Old
111+
pip install outlier-cleaner
112+
113+
# New
114+
pip install statclean
115+
```
116+
117+
#### **Import Changes**
118+
```python
119+
# Old
120+
from outlier_cleaner import OutlierCleaner
121+
122+
# New (recommended)
123+
from statclean import StatClean
124+
125+
# Backward compatible (temporary)
126+
from statclean import OutlierCleaner # Will be deprecated
127+
```
128+
129+
#### **Enhanced Capabilities**
130+
- All existing functionality preserved and enhanced
131+
- New statistical testing methods available
132+
- Improved visualization with more plot types
133+
- Method chaining support for complex workflows
134+
- Enhanced error handling and edge case management
135+
136+
### Future Roadmap
137+
138+
Planned features for upcoming releases:
139+
140+
- **Additional Statistical Tests**: Anderson-Darling, Kolmogorov-Smirnov
141+
- **Advanced Multivariate Methods**: Isolation Forest, Local Outlier Factor
142+
- **Performance Optimizations**: Parallel processing for large datasets
143+
- **Interactive Visualizations**: Plotly integration for interactive analysis
144+
- **Export Capabilities**: Statistical reports in multiple formats (PDF, HTML, LaTeX)
145+
146+
---
147+
148+
*This changelog follows the principles of [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) for clear communication of changes to users and developers.*

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 Subashanan Nair
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

0 commit comments

Comments
 (0)