|
| 1 | +# API Reference |
| 2 | + |
| 3 | +## StatClean Class |
| 4 | + |
| 5 | +### Initialization |
| 6 | + |
| 7 | +```python |
| 8 | +StatClean(df=None, preserve_index=True) |
| 9 | +``` |
| 10 | + |
| 11 | +**Parameters:** |
| 12 | +- `df` (pandas.DataFrame, optional): The DataFrame to clean |
| 13 | +- `preserve_index` (bool, default=True): Whether to preserve original index |
| 14 | + |
| 15 | +### Detection Methods |
| 16 | + |
| 17 | +#### `detect_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)` |
| 18 | +Detect outliers using the Interquartile Range method. |
| 19 | + |
| 20 | +**Parameters:** |
| 21 | +- `column` (str): Column name to analyze |
| 22 | +- `lower_factor` (float): Lower bound multiplier for IQR |
| 23 | +- `upper_factor` (float): Upper bound multiplier for IQR |
| 24 | + |
| 25 | +**Returns:** |
| 26 | +- `pandas.Series`: Boolean mask indicating outliers |
| 27 | + |
| 28 | +#### `detect_outliers_zscore(column, threshold=3.0)` |
| 29 | +Detect outliers using Z-score method. |
| 30 | + |
| 31 | +**Parameters:** |
| 32 | +- `column` (str): Column name to analyze |
| 33 | +- `threshold` (float): Z-score threshold for outlier detection |
| 34 | + |
| 35 | +**Returns:** |
| 36 | +- `pandas.Series`: Boolean mask indicating outliers |
| 37 | + |
| 38 | +#### `detect_outliers_modified_zscore(column, threshold=3.5)` |
| 39 | +Detect outliers using Modified Z-score (MAD-based) method. |
| 40 | + |
| 41 | +**Parameters:** |
| 42 | +- `column` (str): Column name to analyze |
| 43 | +- `threshold` (float): Modified Z-score threshold |
| 44 | + |
| 45 | +**Returns:** |
| 46 | +- `pandas.Series`: Boolean mask indicating outliers |
| 47 | + |
| 48 | +#### `detect_outliers_mahalanobis(columns, chi2_threshold=0.95)` |
| 49 | +Detect multivariate outliers using Mahalanobis distance. |
| 50 | + |
| 51 | +**Parameters:** |
| 52 | +- `columns` (list): List of column names for multivariate analysis |
| 53 | +- `chi2_threshold` (float): Chi-square threshold percentile |
| 54 | + |
| 55 | +**Returns:** |
| 56 | +- `pandas.Series`: Boolean mask indicating outliers |
| 57 | + |
| 58 | +### Treatment Methods |
| 59 | + |
| 60 | +#### `remove_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)` |
| 61 | +Remove outliers using IQR method. |
| 62 | + |
| 63 | +**Returns:** |
| 64 | +- `StatClean`: Self (enables method chaining) |
| 65 | + |
| 66 | +#### `remove_outliers_zscore(column, threshold=3.0)` |
| 67 | +Remove outliers using Z-score method. |
| 68 | + |
| 69 | +**Returns:** |
| 70 | +- `StatClean`: Self (enables method chaining) |
| 71 | + |
| 72 | +#### `winsorize_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)` |
| 73 | +Cap outliers at IQR bounds instead of removing. |
| 74 | + |
| 75 | +**Returns:** |
| 76 | +- `StatClean`: Self (enables method chaining) |
| 77 | + |
| 78 | +### Statistical Testing |
| 79 | + |
| 80 | +#### `grubbs_test(column, alpha=0.05, two_sided=True)` |
| 81 | +Perform Grubbs' test for outliers with statistical significance. |
| 82 | + |
| 83 | +**Parameters:** |
| 84 | +- `column` (str): Column name to test |
| 85 | +- `alpha` (float): Significance level |
| 86 | +- `two_sided` (bool): Whether to perform two-sided test |
| 87 | + |
| 88 | +**Returns:** |
| 89 | +- `dict`: Test results including p-value, test statistic, critical value |
| 90 | + |
| 91 | +#### `dixon_q_test(column, alpha=0.05)` |
| 92 | +Perform Dixon's Q-test for small samples (n < 30). |
| 93 | + |
| 94 | +**Parameters:** |
| 95 | +- `column` (str): Column name to test |
| 96 | +- `alpha` (float): Significance level |
| 97 | + |
| 98 | +**Returns:** |
| 99 | +- `dict`: Test results including Q-statistic, critical value, p-value |
| 100 | + |
| 101 | +### Data Transformations |
| 102 | + |
| 103 | +#### `transform_boxcox(column, lambda_param=None)` |
| 104 | +Apply Box-Cox transformation with automatic lambda estimation. |
| 105 | + |
| 106 | +**Parameters:** |
| 107 | +- `column` (str): Column name to transform |
| 108 | +- `lambda_param` (float, optional): Transformation parameter |
| 109 | + |
| 110 | +**Returns:** |
| 111 | +- `dict`: Transformation results including optimal lambda |
| 112 | + |
| 113 | +#### `recommend_transformation(column)` |
| 114 | +Automatically recommend best transformation based on distribution. |
| 115 | + |
| 116 | +**Parameters:** |
| 117 | +- `column` (str): Column name to analyze |
| 118 | + |
| 119 | +**Returns:** |
| 120 | +- `dict`: Recommendations including best transformation and improvement metrics |
| 121 | + |
| 122 | +### Analysis Methods |
| 123 | + |
| 124 | +#### `analyze_distribution(column)` |
| 125 | +Comprehensive distribution analysis with statistical tests. |
| 126 | + |
| 127 | +**Parameters:** |
| 128 | +- `column` (str): Column name to analyze |
| 129 | + |
| 130 | +**Returns:** |
| 131 | +- `dict`: Distribution analysis including skewness, kurtosis, normality test |
| 132 | + |
| 133 | +#### `compare_methods(columns, methods=['iqr', 'zscore', 'modified_zscore'])` |
| 134 | +Compare agreement between different detection methods. |
| 135 | + |
| 136 | +**Parameters:** |
| 137 | +- `columns` (list): Column names to compare |
| 138 | +- `methods` (list): Detection methods to compare |
| 139 | + |
| 140 | +**Returns:** |
| 141 | +- `dict`: Method comparison results and agreement statistics |
| 142 | + |
| 143 | +### Visualization |
| 144 | + |
| 145 | +#### `plot_outlier_analysis(columns=None, figsize=(15, 5))` |
| 146 | +Generate comprehensive outlier analysis plots. |
| 147 | + |
| 148 | +**Parameters:** |
| 149 | +- `columns` (list, optional): Columns to plot (defaults to all numeric) |
| 150 | +- `figsize` (tuple): Base figure size for each subplot |
| 151 | + |
| 152 | +**Returns:** |
| 153 | +- `dict`: Dictionary of matplotlib figures keyed by column names |
| 154 | + |
| 155 | +### Utility Methods |
| 156 | + |
| 157 | +#### `get_outlier_stats(columns=None, include_indices=False)` |
| 158 | +Get comprehensive outlier statistics without removing data. |
| 159 | + |
| 160 | +**Parameters:** |
| 161 | +- `columns` (list, optional): Columns to analyze |
| 162 | +- `include_indices` (bool): Whether to include outlier indices |
| 163 | + |
| 164 | +**Returns:** |
| 165 | +- `pandas.DataFrame`: Statistics for each column and method |
| 166 | + |
| 167 | +#### `set_thresholds(**kwargs)` |
| 168 | +Configure default thresholds for detection methods. |
| 169 | + |
| 170 | +**Parameters:** |
| 171 | +- `iqr_lower_factor` (float): IQR lower bound multiplier |
| 172 | +- `iqr_upper_factor` (float): IQR upper bound multiplier |
| 173 | +- `zscore_threshold` (float): Z-score threshold |
| 174 | +- `modified_zscore_threshold` (float): Modified Z-score threshold |
| 175 | + |
| 176 | +**Returns:** |
| 177 | +- `StatClean`: Self (enables method chaining) |
| 178 | + |
| 179 | +## Utility Functions |
| 180 | + |
| 181 | +### `plot_outliers(data, outliers, title="Outlier Analysis", figsize=(10, 6))` |
| 182 | +Create scatter plot highlighting outliers. |
| 183 | + |
| 184 | +### `plot_distribution(data, outliers, title="Distribution Analysis", figsize=(10, 6))` |
| 185 | +Plot KDE distribution with outlier separation. |
| 186 | + |
| 187 | +### `plot_boxplot(data, outliers, title="Box Plot Analysis", figsize=(10, 6))` |
| 188 | +Enhanced box plot with outlier overlay. |
| 189 | + |
| 190 | +### `plot_qq(data, outliers, title="Q-Q Plot", figsize=(10, 6))` |
| 191 | +Q-Q plot for normality assessment. |
| 192 | + |
| 193 | +### `plot_outlier_analysis(data, outliers, title="Comprehensive Analysis", figsize=(12, 10))` |
| 194 | +2x2 comprehensive analysis dashboard. |
0 commit comments