SubaashNair
diff --git a/‎.gitignore‎
Lines changed: 4 additions & 1 deletion b/‎.gitignore‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎StatClean.wiki/API-Reference.md‎
Lines changed: 53 additions & 0 deletions b/‎StatClean.wiki/API-Reference.md‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎StatClean.wiki/Advanced-Examples.md‎
Lines changed: 191 additions & 0 deletions b/‎StatClean.wiki/Advanced-Examples.md‎
Lines changed: 191 additions & 0 deletions
diff --git a/‎StatClean.wiki/Contributing.md‎
Lines changed: 37 additions & 0 deletions b/‎StatClean.wiki/Contributing.md‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎StatClean.wiki/Home.md‎
Lines changed: 45 additions & 0 deletions b/‎StatClean.wiki/Home.md‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎StatClean.wiki/Installation-Guide.md‎
Lines changed: 44 additions & 0 deletions b/‎StatClean.wiki/Installation-Guide.md‎
Lines changed: 44 additions & 0 deletions
@@ -38,4 +38,7 @@ build/
 
 # Misc
 .DS_Store
-.env 
+.env
+
+# Claude Logs
+claude_logs/
@@ -0,0 +1,53 @@
+# API Reference
+
+## Core Class
+- `StatClean(df: DataFrame, preserve_index: bool = True)`
+  - `set_data`, `set_thresholds`, `get_thresholds`, `reset`, `get_summary_report`
+
+## Detection (non-destructive)
+- `detect_outliers_iqr(column, lower_factor=None, upper_factor=None)` → Series
+- `detect_outliers_zscore(column, threshold=None)` → Series
+- `detect_outliers_modified_zscore(column, threshold=None)` → Series
+- `detect_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → Series
+
+## Removal / Winsorizing (chained)
+- `remove_outliers_iqr(column, ...)` → self
+- `remove_outliers_zscore(column, threshold=None)` → self
+- `remove_outliers_modified_zscore(column, threshold=None)` → self
+- `remove_outliers_mahalanobis(columns=None, chi2_threshold=None, use_shrinkage=False)` → self
+- `winsorize_outliers_iqr(column, ...)` → self
+- `winsorize_outliers_zscore(column, threshold=None)` → self
+- `winsorize_outliers_percentile(column, lower_percentile=5, upper_percentile=95)` → self
+
+## Analysis & Utilities
+- `analyze_distribution(column)` → dict (skewness, kurtosis, normality, recommendation)
+- `compare_methods(columns=None, methods=None, ...)` → dict summary
+- `get_outlier_stats(columns=None, methods=['iqr','zscore'], ...)` → DataFrame
+- `plot_outlier_analysis(columns=None, methods=None, figsize=(15,5))` → dict[str, Figure]
+- `visualize_outliers(column)` → None
+
+## Transformations
+- `transform_boxcox(column, lambda_param=None)` → (self, info)
+- `transform_log(column, base='natural')` → (self, info)
+- `transform_sqrt(column)` → (self, info)
+- `recommend_transformation(column)` → dict
+
+## Utils (module `statclean.utils`)
+- `plot_outliers(series, outliers_mask, title=None)`
+- `plot_distribution(series, outliers_mask=None, title=None)`
+- `plot_boxplot(series, title=None)`
+- `plot_qq(series, outliers_mask=None, title=None)`
+- `plot_outlier_analysis(data, outliers=None)`
+
+Notes:
+- Remover methods return `self` for chaining; access data via `cleaner.clean_df`.
+- Mahalanobis supports percentile thresholds and shrinkage covariance.
+
+## See Also
+
+- [[Statistical Methods Guide|Statistical-Methods-Guide]] - Theory behind the methods
+- [[Quick Start Tutorial|Quick-Start-Tutorial]] - Step-by-step usage guide
+- [[Advanced Examples|Advanced-Examples]] - Complex real-world scenarios
+- [[Troubleshooting|Troubleshooting]] - Common issues and solutions
+
+[Back to top](#api-reference)
@@ -0,0 +1,191 @@
+# Advanced Examples
+
+## California Housing End-to-End
+```python
+import pandas as pd
+from sklearn.datasets import fetch_california_housing
+from statclean import StatClean
+
+housing = fetch_california_housing()
+df = pd.DataFrame(housing.data, columns=housing.feature_names)
+df['PRICE'] = housing.target
+
+cleaner = StatClean(df, preserve_index=True)
+
+# Analyze & clean selected features
+features = ['MedInc', 'AveRooms', 'PRICE']
+cleaned_df, info = cleaner.clean_columns(features, method='auto', show_progress=True)
+
+# Multivariate check
+mv_outliers = cleaner.detect_outliers_mahalanobis(['MedInc', 'AveRooms', 'PRICE'], chi2_threshold=0.975)
+print('Multivariate outliers:', mv_outliers.sum())
+
+# Visualization grid
+figs = cleaner.plot_outlier_analysis(features)
+```
+
+## Financial Data Preprocessing
+```python
+import pandas as pd
+import numpy as np
+from statclean import StatClean
+
+# Simulate financial returns data
+np.random.seed(42)
+returns = np.random.normal(0.001, 0.02, 1000)  # Daily returns
+prices = 100 * np.cumprod(1 + returns)
+volumes = np.random.lognormal(15, 1, 1000)
+
+# Add some outliers (market crashes/spikes)
+returns[250] = -0.15  # Market crash
+returns[500] = 0.08   # Large gain
+volumes[100] = volumes[100] * 50  # Volume spike
+
+df = pd.DataFrame({
+    'returns': returns,
+    'prices': prices,
+    'volume': volumes,
+    'volatility': pd.Series(returns).rolling(20).std()
+})
+
+cleaner = StatClean(df.dropna(), preserve_index=True)
+
+# Financial outlier detection with domain-specific thresholds
+financial_features = ['returns', 'volume', 'volatility']
+
+# Statistical significance testing for returns
+grubbs_results = {}
+for feature in financial_features:
+    result = cleaner.grubbs_test(feature, alpha=0.01)  # Stricter alpha for finance
+    grubbs_results[feature] = result
+    print(f"{feature}: Outlier detected = {result['is_outlier']}, p-value = {result['p_value']:.6f}")
+
+# Conservative cleaning with winsorization (preserve extreme but valid movements)
+cleaner.winsorize_outliers_percentile('volume', lower_percentile=1, upper_percentile=99)
+cleaner.winsorize_outliers_percentile('volatility', lower_percentile=5, upper_percentile=95)
+
+# More aggressive cleaning for returns (likely data errors)
+cleaner.remove_outliers_modified_zscore('returns', threshold=4.0)  # Conservative threshold
+
+cleaned_df = cleaner.clean_df
+print(f"Original shape: {df.shape}, Cleaned shape: {cleaned_df.shape}")
+```
+
+## Time Series Sensor Data
+```python
+import pandas as pd
+import numpy as np
+from datetime import datetime, timedelta
+from statclean import StatClean
+
+# Simulate IoT sensor data
+np.random.seed(123)
+dates = pd.date_range(start='2024-01-01', periods=2000, freq='H')
+base_temp = 20 + 10 * np.sin(2 * np.pi * np.arange(2000) / 24)  # Daily cycle
+noise = np.random.normal(0, 2, 2000)
+temperatures = base_temp + noise
+
+# Add sensor malfunctions and anomalies
+temperatures[500:510] = -999  # Sensor error (impossible temperature)
+temperatures[1000] = 150      # Sensor spike
+temperatures[1500:1505] = np.nan  # Missing readings
+
+humidity = np.clip(50 + 30 * np.sin(2 * np.pi * np.arange(2000) / 24) + np.random.normal(0, 5, 2000), 0, 100)
+pressure = 1013 + np.random.normal(0, 15, 2000)
+
+df = pd.DataFrame({
+    'timestamp': dates,
+    'temperature': temperatures,
+    'humidity': humidity,
+    'pressure': pressure
+})
+
+# Handle time series specific preprocessing
+df = df[df['temperature'] > -50]  # Remove impossible sensor readings first
+cleaner = StatClean(df, preserve_index=True)
+
+# Time series outlier detection with domain knowledge
+sensor_features = ['temperature', 'humidity', 'pressure']
+
+# Distribution analysis for each sensor
+for feature in sensor_features:
+    analysis = cleaner.analyze_distribution(feature)
+    print(f"\n{feature} Analysis:")
+    print(f"  Skewness: {analysis['skewness']:.3f}")
+    print(f"  Recommended method: {analysis['recommended_method']}")
+    
+    # Apply recommended transformation if highly skewed
+    if abs(analysis['skewness']) > 2:
+        cleaner.transform_boxcox(feature)
+
+# Gentle cleaning for sensor data (preserve natural variation)
+cleaned_df, info = cleaner.clean_columns(
+    sensor_features, 
+    method='modified_zscore',  # Robust to occasional spikes
+    show_progress=True
+)
+
+# Time series specific visualization
+for feature in sensor_features:
+    print(f"\n{feature} Cleaning Results:")
+    print(f"  Method used: {info[feature]['method_used']}")
+    print(f"  Outliers removed: {info[feature]['outliers_removed']}")
+
+# Generate comprehensive plots for time series data
+figs = cleaner.plot_outlier_analysis(sensor_features)
+```
+
+## Modified Z-score Visualization
+```python
+outliers = cleaner.detect_outliers_modified_zscore('PRICE')
+cleaner.remove_outliers_modified_zscore('PRICE')
+cleaner.visualize_outliers('PRICE')
+```
+
+## Method Comparison for Research Data
+```python
+import pandas as pd
+from statclean import StatClean
+
+# Simulate experimental research data
+np.random.seed(456)
+df = pd.DataFrame({
+    'reaction_time': np.random.gamma(2, 0.15, 500),  # Skewed distribution
+    'accuracy': np.random.beta(8, 2, 500) * 100,     # Bounded data
+    'confidence': np.random.normal(7, 1.5, 500)      # Normal-ish data
+})
+
+# Add some experimental outliers
+df.loc[50:52, 'reaction_time'] *= 5  # Participant distraction
+df.loc[100, 'accuracy'] = 30         # Data entry error
+df.loc[200:205, 'confidence'] = np.nan  # Missing responses
+
+cleaner = StatClean(df.dropna(), preserve_index=True)
+
+# Compare detection methods for research validity
+research_features = ['reaction_time', 'accuracy', 'confidence']
+comparison = cleaner.compare_methods(
+    research_features, 
+    methods=['iqr', 'zscore', 'modified_zscore', 'grubbs']
+)
+
+# Statistical reporting for publication
+print("Method Agreement Analysis for Research Data:")
+for feature in research_features:
+    print(f"\n{feature}:")
+    print(f"  {comparison[feature]['summary']}")
+    
+    # Formal statistical tests
+    grubbs_result = cleaner.grubbs_test(feature, alpha=0.05)
+    dixon_result = cleaner.dixon_q_test(feature, alpha=0.05)
+    
+    print(f"  Grubbs test: p = {grubbs_result['p_value']:.6f}")
+    print(f"  Dixon Q test: p = {dixon_result['p_value']:.6f}")
+
+# Generate publication-quality report
+summary_report = cleaner.get_summary_report()
+print("\nPublication Summary:")
+print(summary_report)
+```
+
+[Back to top](#advanced-examples)
@@ -0,0 +1,37 @@
+# Contributing
+
+## Development Setup
+
+```bash
+git clone https://github.com/SubaashNair/StatClean.git
+cd StatClean
+pip install -e .
+pip install pytest
+```
+
+## Running Tests (Headless)
+
+```bash
+export MPLBACKEND=Agg
+pytest -q
+```
+
+## Code Style
+- Follow PEP 8
+- Use type hints
+- Add docstrings to all functions
+- No Claude references in commits
+
+## Pull Request Process
+1. Fork the repository
+2. Create feature branch: `git checkout -b feature-name`
+3. Make changes with tests
+4. Run test suite: `pytest`
+5. Submit pull request
+
+## Areas for Contribution
+- Additional statistical tests
+- Performance optimizations
+- New visualization methods
+- Documentation improvements
+- Bug fixes
@@ -0,0 +1,45 @@
+# StatClean
+
+Data preprocessing & outlier detection with formal statistical methods and publication-quality reporting.
+
+[![PyPI](https://img.shields.io/pypi/v/statclean.svg)](https://pypi.org/project/statclean/)
+[![Build](https://github.com/SubaashNair/StatClean/actions/workflows/publish.yml/badge.svg)](https://github.com/SubaashNair/StatClean/actions)
+[![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://subaashnair.github.io/StatClean/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/SubaashNair/StatClean/blob/main/LICENSE)
+
+> Note: Remover methods return `self`. Access cleaned data via `cleaner.clean_df` and details via `cleaner.outlier_info`.
+
+## Quick Links
+
+| Getting Started | Learn |
+|---|---|
+| [[Installation Guide|Installation-Guide]] | [[Statistical Methods|Statistical-Methods-Guide]] |
+| [[Quick Start Tutorial|Quick-Start-Tutorial]] | [[API Reference|API-Reference]] |
+| [[Advanced Examples|Advanced-Examples]] | [[Performance Tips|Performance-Tips]] |
+| [[Troubleshooting|Troubleshooting]] | [[Contributing|Contributing]] |
+
+## Feature Overview
+
+| Feature | Univariate | Multivariate | Formal Test |
+|---|---:|---:|---:|
+| IQR | ✅ |  |  |
+| Z-score | ✅ |  |  |
+| Modified Z-score | ✅ |  |  |
+| Mahalanobis |  | ✅ |  |
+| Grubbs | ✅ |  | ✅ |
+| Dixon Q | ✅ |  | ✅ |
+
+## How It Flows
+
+```mermaid
+flowchart LR
+  A[DataFrame] --> B[Analyze Distribution]
+  B --> C{Recommend Method}
+  C --> D[IQR / Z / Modified Z]
+  C --> E[Mahalanobis]
+  D --> F[Remove / Winsorize]
+  E --> F
+  F --> G[Report & Plots]
+```
+
+[Back to top](#statclean)
@@ -0,0 +1,44 @@
+# Installation Guide
+
+## Quick Install
+
+```bash
+pip install statclean
+```
+
+## Requirements
+
+- Python 3.7+
+- numpy >= 1.19.0
+- pandas >= 1.2.0
+- matplotlib >= 3.3.0
+- seaborn >= 0.11.0
+- scipy >= 1.6.0
+- tqdm >= 4.60.0
+
+## Development Install
+
+```bash
+git clone https://github.com/SubaashNair/StatClean.git
+cd StatClean
+pip install -e .
+```
+
+## Verification
+
+```python
+from statclean import StatClean
+print("Installation successful!")
+```
+
+## Next Steps
+
+After installation, check out:
+- [[Quick Start Tutorial|Quick-Start-Tutorial]] - Your first steps with StatClean
+- [[Advanced Examples|Advanced-Examples]] - Real-world usage examples
+- [[API Reference|API-Reference]] - Complete method documentation
+
+## See Also
+
+- [[Troubleshooting|Troubleshooting]] - Installation issues and solutions
+- [[Contributing|Contributing]] - Development setup guide