|
| 1 | +# Changelog |
| 2 | + |
| 3 | +All notable changes to StatClean will be documented in this file. |
| 4 | + |
| 5 | +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), |
| 6 | +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
| 7 | + |
| 8 | +## [0.1.0] - 2025-08-06 |
| 9 | + |
| 10 | +### 🎉 Initial Release of StatClean |
| 11 | + |
| 12 | +This marks the initial public release of StatClean, a comprehensive statistical data preprocessing and outlier detection library. The project has been completely rebranded from OutlierCleaner to StatClean with expanded statistical capabilities. |
| 13 | + |
| 14 | +### Added |
| 15 | + |
| 16 | +#### **Formal Statistical Testing** |
| 17 | +- **Grubbs' Test**: Single outlier detection with p-values and critical values |
| 18 | +- **Dixon's Q-Test**: Outlier detection for small samples (n < 30) with statistical significance |
| 19 | +- **Distribution Analysis**: Automatic normality testing using Shapiro-Wilk test |
| 20 | +- **Statistical Validation**: P-values, confidence intervals, and critical value calculations |
| 21 | + |
| 22 | +#### **Detection Methods** |
| 23 | +- **Univariate Methods**: |
| 24 | + - IQR (Interquartile Range) with configurable factors |
| 25 | + - Z-score method with customizable thresholds |
| 26 | + - Modified Z-score using MAD (robust to non-normal distributions) |
| 27 | +- **Multivariate Methods**: |
| 28 | + - Mahalanobis distance outlier detection with chi-square thresholds |
| 29 | +- **Batch Detection**: Multi-column outlier detection with progress tracking |
| 30 | + |
| 31 | +#### **Treatment Options** |
| 32 | +- **Outlier Removal**: Remove detected outliers with statistical validation |
| 33 | +- **Winsorizing**: Cap outliers at specified bounds instead of removal |
| 34 | + - IQR-based winsorizing |
| 35 | + - Z-score based winsorizing |
| 36 | + - Percentile-based winsorizing |
| 37 | +- **Data Transformations**: |
| 38 | + - Box-Cox transformation with automatic lambda estimation |
| 39 | + - Logarithmic transformations (natural, base 10, base 2) |
| 40 | + - Square root transformation |
| 41 | + - Automatic transformation recommendation based on distribution analysis |
| 42 | + |
| 43 | +#### **Advanced Visualization** |
| 44 | +- **Comprehensive Analysis Plots**: 3-in-1 analysis dashboard (boxplot, distribution, Q-Q plot) |
| 45 | +- **Standalone Plotting Functions**: |
| 46 | + - `plot_outliers()`: Scatter plots with outlier highlighting |
| 47 | + - `plot_distribution()`: KDE distribution plots with outlier separation |
| 48 | + - `plot_boxplot()`: Enhanced box plots with outlier overlay |
| 49 | + - `plot_qq()`: Q-Q plots for normality assessment |
| 50 | + - `plot_outlier_analysis()`: 2x2 comprehensive analysis grid |
| 51 | +- **Publication-Ready Figures**: Professional styling with customizable parameters |
| 52 | + |
| 53 | +#### **Developer Experience Features** |
| 54 | +- **Method Chaining**: Fluent API enabling streamlined workflows |
| 55 | +- **Type Safety**: Comprehensive type hints for enhanced IDE support and error detection |
| 56 | +- **Progress Tracking**: Built-in progress bars using tqdm for batch operations |
| 57 | +- **Flexible Configuration**: Customizable thresholds and statistical parameters |
| 58 | +- **Memory Efficiency**: Statistics caching and lazy evaluation for performance |
| 59 | + |
| 60 | +#### **Analysis and Reporting** |
| 61 | +- **Distribution Analysis**: Comprehensive statistical analysis including: |
| 62 | + - Skewness and kurtosis calculation |
| 63 | + - Automatic method recommendation based on distribution characteristics |
| 64 | + - Normality testing with statistical significance |
| 65 | +- **Method Comparison**: Statistical agreement analysis between different detection methods |
| 66 | +- **Batch Processing**: Multi-column processing with detailed reporting and progress tracking |
| 67 | +- **Summary Reports**: Publication-quality statistical summaries |
| 68 | + |
| 69 | +#### **Utility Features** |
| 70 | +- **Index Preservation**: Configurable index handling during data cleaning |
| 71 | +- **Missing Value Handling**: Robust handling of NaN values and edge cases |
| 72 | +- **Data Validation**: Automatic data type validation and error handling |
| 73 | +- **Statistics Caching**: Efficient caching for repeated statistical operations |
| 74 | + |
| 75 | +### Technical |
| 76 | + |
| 77 | +#### **Core Architecture** |
| 78 | +- **StatClean Class**: Main class with 40+ methods for comprehensive statistical preprocessing |
| 79 | +- **Modular Design**: Separate utils module for standalone visualization functions |
| 80 | +- **Robust Error Handling**: Comprehensive edge case handling for statistical computations |
| 81 | +- **Performance Optimization**: Lazy evaluation and efficient memory usage |
| 82 | + |
| 83 | +#### **Dependencies** |
| 84 | +- **Core**: numpy, pandas, matplotlib, seaborn, scipy |
| 85 | +- **Statistical**: scipy for advanced statistical tests and distributions |
| 86 | +- **Progress**: tqdm for user-friendly progress tracking |
| 87 | +- **Development**: Complete type annotations for static analysis |
| 88 | + |
| 89 | +#### **API Design** |
| 90 | +- **Intuitive Interface**: Clear method naming and consistent parameter patterns |
| 91 | +- **Flexible Parameters**: Configurable thresholds and statistical significance levels |
| 92 | +- **Return Types**: Comprehensive return dictionaries with statistical metadata |
| 93 | +- **Documentation**: Extensive docstrings with mathematical explanations |
| 94 | + |
| 95 | +### Package Information |
| 96 | + |
| 97 | +- **Package Name**: `statclean` (renamed from `outlier-cleaner`) |
| 98 | +- **Main Class**: `StatClean` (renamed from `OutlierCleaner`) |
| 99 | +- **Backward Compatibility**: Alias maintained for transition (`OutlierCleaner = StatClean`) |
| 100 | +- **Version**: 0.1.0 (semantic versioning reset for new package) |
| 101 | +- **Python Support**: ≥3.7 |
| 102 | +- **Development Status**: Production/Stable |
| 103 | + |
| 104 | +### Migration Notes |
| 105 | + |
| 106 | +For users migrating from OutlierCleaner: |
| 107 | + |
| 108 | +#### **Package Installation** |
| 109 | +```bash |
| 110 | +# Old |
| 111 | +pip install outlier-cleaner |
| 112 | + |
| 113 | +# New |
| 114 | +pip install statclean |
| 115 | +``` |
| 116 | + |
| 117 | +#### **Import Changes** |
| 118 | +```python |
| 119 | +# Old |
| 120 | +from outlier_cleaner import OutlierCleaner |
| 121 | + |
| 122 | +# New (recommended) |
| 123 | +from statclean import StatClean |
| 124 | + |
| 125 | +# Backward compatible (temporary) |
| 126 | +from statclean import OutlierCleaner # Will be deprecated |
| 127 | +``` |
| 128 | + |
| 129 | +#### **Enhanced Capabilities** |
| 130 | +- All existing functionality preserved and enhanced |
| 131 | +- New statistical testing methods available |
| 132 | +- Improved visualization with more plot types |
| 133 | +- Method chaining support for complex workflows |
| 134 | +- Enhanced error handling and edge case management |
| 135 | + |
| 136 | +### Future Roadmap |
| 137 | + |
| 138 | +Planned features for upcoming releases: |
| 139 | + |
| 140 | +- **Additional Statistical Tests**: Anderson-Darling, Kolmogorov-Smirnov |
| 141 | +- **Advanced Multivariate Methods**: Isolation Forest, Local Outlier Factor |
| 142 | +- **Performance Optimizations**: Parallel processing for large datasets |
| 143 | +- **Interactive Visualizations**: Plotly integration for interactive analysis |
| 144 | +- **Export Capabilities**: Statistical reports in multiple formats (PDF, HTML, LaTeX) |
| 145 | + |
| 146 | +--- |
| 147 | + |
| 148 | +*This changelog follows the principles of [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) for clear communication of changes to users and developers.* |
0 commit comments