Skip to content

Commit 377e4e0

Browse files
committed
chore: prepare v0.1.3 release\n\n- docs: update README, wiki content, changelog\n- feat: add PyPI publish workflow via Trusted Publisher (OIDC)\n- fix: seaborn y= for utils; clarify Mahalanobis & transformations\n- chore: bump version to 0.1.3; downgrade classifier to Beta
1 parent 4953c18 commit 377e4e0

File tree

11 files changed

+264
-88
lines changed

11 files changed

+264
-88
lines changed

.github/workflows/publish.yml

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,51 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
push:
5+
tags:
6+
- 'v*'
7+
8+
permissions:
9+
contents: read
10+
id-token: write
11+
12+
jobs:
13+
test-build-publish:
14+
runs-on: ubuntu-latest
15+
steps:
16+
- name: Check out code
17+
uses: actions/checkout@v4
18+
19+
- name: Set up Python
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: '3.10'
23+
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -r requirements.txt
28+
pip install pytest build
29+
30+
- name: Run tests (headless)
31+
env:
32+
MPLBACKEND: Agg
33+
run: pytest -q
34+
35+
- name: Verify tag matches version
36+
run: |
37+
VERSION=$(python -c "import re;print(re.search(r"""__version__\s*=\s*'([^']+)'""", open('statclean/__init__.py').read()).group(1))")
38+
TAG=${GITHUB_REF_NAME#v}
39+
echo "Version: $VERSION | Tag: $TAG"
40+
if [ "$VERSION" != "$TAG" ]; then echo "Tag ($TAG) does not match version ($VERSION)" && exit 1; fi
41+
42+
- name: Build package
43+
run: python -m build
44+
45+
- name: Publish to PyPI (Trusted Publisher)
46+
uses: pypa/gh-action-pypi-publish@release/v1
47+
with:
48+
skip-existing: true
149
name: Publish to PyPI and GitHub Release
250

351
on:

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,27 @@ All notable changes to StatClean will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.1.3] - 2025-08-08
9+
10+
### Changed
11+
- Align docs/examples with actual API: remover methods return `self`; retrieve cleaned data via `cleaner.clean_df`.
12+
- Grubbs/Dixon docs updated to use keys `statistic` and `is_outlier`.
13+
- Clarified Mahalanobis `chi2_threshold` semantics; now accepts percentile in (0,1] or absolute chi-square statistic.
14+
- Seaborn plotting updated to explicit `y=`/`x=` to improve compatibility.
15+
- Transformations (`Box-Cox`, `log`, `sqrt`) preserve NaN positions; Box-Cox now computes on non-NA values only.
16+
- `analyze_distribution` is NaN-safe and limits Shapiro sample size.
17+
- Improved Mahalanobis stability: condition checks, pseudoinverse fallback, optional shrinkage covariance (`use_shrinkage`).
18+
- Replaced prints with `warnings.warn` where appropriate.
19+
20+
### Added
21+
- GitHub Actions workflow for building and publishing to PyPI on release tags.
22+
23+
### Fixed
24+
- Example unpacking errors and empty DataFrame init expectation in examples/tests.
25+
- Modified Z-score visualization now computes and labels bounds correctly.
26+
27+
---
28+
829
## [0.1.0] - 2025-08-06
930

1031
### 🎉 Initial Release of StatClean

README.md

Lines changed: 52 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,11 @@ df = pd.DataFrame({
5555
'age': [25, 30, 35, 40, 35, 45, 50]
5656
})
5757

58+
"""
59+
Note: As of v0.1.3, remover methods return the cleaner instance for method chaining.
60+
Access cleaned data via `cleaner.clean_df` and details via `cleaner.outlier_info`.
61+
"""
62+
5863
# Initialize StatClean
5964
cleaner = StatClean(df)
6065

@@ -73,21 +78,23 @@ print(f"Outliers removed: {info['income']['outliers_removed']}")
7378
```python
7479
# Grubbs' test for outliers with statistical significance
7580
result = cleaner.grubbs_test('income', alpha=0.05)
76-
print(f"Test statistic: {result['test_statistic']:.3f}")
81+
print(f"Test statistic: {result['statistic']:.3f}")
7782
print(f"P-value: {result['p_value']:.6f}")
78-
print(f"Outlier detected: {result['outlier_detected']}")
83+
print(f"Outlier detected: {result['is_outlier']}")
7984

8085
# Dixon's Q-test for small samples
8186
result = cleaner.dixon_q_test('age', alpha=0.05)
82-
print(f"Q statistic: {result['q_statistic']:.3f}")
87+
print(f"Q statistic: {result['statistic']:.3f}")
8388
print(f"Critical value: {result['critical_value']:.3f}")
8489
```
8590

8691
### Multivariate Outlier Detection
8792

8893
```python
8994
# Mahalanobis distance for multivariate outliers
90-
outliers = cleaner.detect_outliers_mahalanobis(['income', 'age'], chi2_threshold=0.95)
95+
# chi2_threshold can be a percentile (0<val<=1) or absolute chi-square statistic
96+
# use_shrinkage=True uses Ledoit–Wolf shrinkage covariance if scikit-learn is installed
97+
outliers = cleaner.detect_outliers_mahalanobis(['income', 'age'], chi2_threshold=0.95, use_shrinkage=True)
9198
print(f"Multivariate outliers detected: {outliers.sum()}")
9299

93100
# Remove multivariate outliers
@@ -99,12 +106,12 @@ cleaned_df = cleaner.remove_outliers_mahalanobis(['income', 'age'])
99106
```python
100107
# Automatic transformation recommendation
101108
recommendation = cleaner.recommend_transformation('income')
102-
print(f"Recommended transformation: {recommendation['best_transformation']}")
103-
print(f"Improvement in skewness: {recommendation['skewness_improvement']:.3f}")
109+
print(f"Recommended transformation: {recommendation['recommended_method']}")
110+
print(f"Improvement in skewness: {recommendation['expected_improvement']:.3f}")
104111

105112
# Apply Box-Cox transformation
106-
transformed_df = cleaner.transform_boxcox('income')
107-
print(f"Optimal lambda: {transformed_df['lambda']:.3f}")
113+
_, info = cleaner.transform_boxcox('income')
114+
print(f"Optimal lambda: {info['lambda']:.3f}")
108115

109116
# Method chaining for complex workflows
110117
result = (cleaner
@@ -263,10 +270,19 @@ for feature in features:
263270
- **seaborn**: ≥0.11.0
264271
- **scipy**: ≥1.6.0 (for statistical tests)
265272
- **tqdm**: ≥4.60.0 (for progress bars)
266-
- **scikit-learn**: ≥0.24.0 (optional, for examples)
273+
- **scikit-learn**: ≥0.24.0 (optional, for shrinkage covariance in Mahalanobis)
267274

268275
## Changelog
269276

277+
### Version 0.1.3 (2025-08-08)
278+
279+
- Align docs/examples with actual API: remover methods return `self`; use `cleaner.clean_df` and `cleaner.outlier_info`.
280+
- Grubbs/Dixon result keys clarified: `statistic`, `is_outlier`.
281+
- Mahalanobis `chi2_threshold` accepts percentile (0<val<=1) or absolute chi-square statistic; added `use_shrinkage` option.
282+
- Transformations preserve NaNs; Box-Cox computed on non-NA values only.
283+
- Seaborn plotting calls updated for compatibility; analysis functions made NaN-safe.
284+
- Added GitHub Actions workflow to publish to PyPI on releases.
285+
270286
### Version 0.1.0 (2025-08-06)
271287

272288
**🎉 Initial Release of StatClean**
@@ -319,4 +335,30 @@ MIT License
319335

320336
---
321337

322-
*StatClean: Where statistical rigor meets practical data science.*
338+
*StatClean: Where statistical rigor meets practical data science.*
339+
340+
## Development: Run Tests in Headless Mode and Capture Logs
341+
342+
```bash
343+
# Ensure a headless matplotlib backend and run tests quietly
344+
export MPLBACKEND=Agg
345+
pytest -q
346+
347+
# Save a timestamped test log (example)
348+
LOG=cursor_logs/test_log.md
349+
mkdir -p cursor_logs
350+
echo "==== $(date) ====\n" >> "$LOG"
351+
MPLBACKEND=Agg pytest -q 2>&1 | tee -a "$LOG"
352+
353+
## Continuous Delivery: Publish to PyPI (Trusted Publisher)
354+
355+
This repository includes a GitHub Actions workflow using PyPI Trusted Publisher (OIDC).
356+
357+
Setup (one-time on PyPI):
358+
- Add this GitHub repo as a Trusted Publisher in the PyPI project settings.
359+
360+
Release steps:
361+
1. Bump version in `statclean/__init__.py` and `setup.py` (already `0.1.3`).
362+
2. Push a tag matching the version, e.g., `git tag v0.1.3 && git push origin v0.1.3`.
363+
3. Workflow will run tests, build, and publish to PyPI without storing credentials.
364+
```

docs/api-reference.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,13 @@ Detect outliers using Modified Z-score (MAD-based) method.
4545
**Returns:**
4646
- `pandas.Series`: Boolean mask indicating outliers
4747

48-
#### `detect_outliers_mahalanobis(columns, chi2_threshold=0.95)`
48+
#### `detect_outliers_mahalanobis(columns, chi2_threshold=None, use_shrinkage=False)`
4949
Detect multivariate outliers using Mahalanobis distance.
5050

5151
**Parameters:**
5252
- `columns` (list): List of column names for multivariate analysis
53-
- `chi2_threshold` (float): Chi-square threshold percentile
53+
- `chi2_threshold` (float): If `None`, defaults to 97.5th percentile; if `0 < value <= 1`, treated as percentile; otherwise treated as absolute chi-square threshold
54+
- `use_shrinkage` (bool): Use Ledoit–Wolf shrinkage covariance estimator when available (requires scikit-learn); falls back to sample covariance otherwise
5455

5556
**Returns:**
5657
- `pandas.Series`: Boolean mask indicating outliers
@@ -86,7 +87,7 @@ Perform Grubbs' test for outliers with statistical significance.
8687
- `two_sided` (bool): Whether to perform two-sided test
8788

8889
**Returns:**
89-
- `dict`: Test results including p-value, test statistic, critical value
90+
- `dict`: Test results including `statistic`, `p_value`, `critical_value`, `is_outlier`, `outlier_value`, `outlier_index`
9091

9192
#### `dixon_q_test(column, alpha=0.05)`
9293
Perform Dixon's Q-test for small samples (n < 30).
@@ -96,7 +97,7 @@ Perform Dixon's Q-test for small samples (n < 30).
9697
- `alpha` (float): Significance level
9798

9899
**Returns:**
99-
- `dict`: Test results including Q-statistic, critical value, p-value
100+
- `dict`: Test results including `statistic`, `critical_value`, `p_value`, `is_outlier`
100101

101102
### Data Transformations
102103

docs/examples.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,11 @@ print(f"Cleaned shape: {cleaned_df.shape}")
2929
# Formal statistical testing
3030
grubbs_result = cleaner.grubbs_test('income', alpha=0.05)
3131
print(f"P-value: {grubbs_result['p_value']:.6f}")
32-
print(f"Outlier detected: {grubbs_result['outlier_detected']}")
32+
print(f"Outlier detected: {grubbs_result['is_outlier']}")
3333

3434
# Dixon's Q-test for small samples
3535
dixon_result = cleaner.dixon_q_test('age', alpha=0.05)
36-
print(f"Q-statistic: {dixon_result['q_statistic']:.3f}")
36+
print(f"Statistic: {dixon_result['statistic']:.3f}")
3737
```
3838

3939
## Multivariate Analysis Example
@@ -52,11 +52,11 @@ cleaner.remove_outliers_mahalanobis(['income', 'age'])
5252
```python
5353
# Automatic transformation recommendation
5454
recommendation = cleaner.recommend_transformation('income')
55-
print(f"Best transformation: {recommendation['best_transformation']}")
55+
print(f"Best transformation: {recommendation['recommended_method']}")
5656

5757
# Apply Box-Cox transformation
58-
transformed = cleaner.transform_boxcox('income')
59-
print(f"Optimal lambda: {transformed['lambda']:.3f}")
58+
_, info = cleaner.transform_boxcox('income')
59+
print(f"Optimal lambda: {info['lambda']:.3f}")
6060
```
6161

6262
## Method Chaining Example
@@ -80,9 +80,8 @@ print(f"Recommended method: {analysis['recommended_method']}")
8080

8181
# Compare detection methods
8282
comparison = cleaner.compare_methods(['income'])
83-
print("Method Agreement:")
84-
for method, stats in comparison['income']['method_stats'].items():
85-
print(f" {method}: {stats['outliers_detected']} outliers")
83+
print("Method Comparison Summary:")
84+
print(comparison['income']['summary'])
8685
```
8786

8887
## Visualization Example

examples/comprehensive_demo.py

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -89,10 +89,13 @@ def test_initialization(results: TestResults, df: pd.DataFrame):
8989
cleaner_preserve = StatClean(df, preserve_index=True)
9090
results.add_pass("Initialization with preserve_index=True")
9191

92-
# Test with empty DataFrame
93-
empty_df = pd.DataFrame()
94-
cleaner_empty = StatClean(empty_df)
95-
results.add_pass("Initialization with empty DataFrame")
92+
# Test with empty DataFrame should raise ValueError
93+
try:
94+
empty_df = pd.DataFrame()
95+
StatClean(empty_df)
96+
results.add_fail("Initialization with empty DataFrame", "Expected ValueError not raised")
97+
except ValueError:
98+
results.add_pass("Initialization with empty DataFrame", "Correctly raised ValueError")
9699

97100
except Exception as e:
98101
results.add_fail("Initialization", str(e))
@@ -147,23 +150,26 @@ def test_outlier_detection_methods(results: TestResults, df: pd.DataFrame):
147150

148151
# Test IQR method
149152
try:
150-
cleaned_df, info = cleaner.remove_outliers_iqr(test_column)
153+
cleaner.remove_outliers_iqr(test_column)
154+
info = cleaner.outlier_info[test_column]
151155
results.add_pass("remove_outliers_iqr",
152156
f"Removed {info['num_outliers']} outliers, {info['percent_removed']:.1f}%")
153157
except Exception as e:
154158
results.add_fail("remove_outliers_iqr", str(e))
155159

156160
# Test Z-score method
157161
try:
158-
cleaned_df, info = cleaner.remove_outliers_zscore(test_column)
162+
cleaner.remove_outliers_zscore(test_column)
163+
info = cleaner.outlier_info[test_column]
159164
results.add_pass("remove_outliers_zscore",
160165
f"Removed {info['num_outliers']} outliers, {info['percent_removed']:.1f}%")
161166
except Exception as e:
162167
results.add_fail("remove_outliers_zscore", str(e))
163168

164169
# Test Modified Z-score method
165170
try:
166-
cleaned_df, info = cleaner.remove_outliers_modified_zscore(test_column)
171+
cleaner.remove_outliers_modified_zscore(test_column)
172+
info = cleaner.outlier_info[test_column]
167173
results.add_pass("remove_outliers_modified_zscore",
168174
f"Removed {info['num_outliers']} outliers, {info['percent_removed']:.1f}%")
169175
except Exception as e:

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
setup(
66
name="statclean",
7-
version="0.1.2",
7+
version="0.1.3",
88
author="Subashanan Nair",
99
author_email="[email protected]",
1010
description="A comprehensive statistical data preprocessing and outlier detection library with formal statistical testing and publication-quality reporting",
@@ -21,7 +21,7 @@
2121
},
2222
packages=find_packages(),
2323
classifiers=[
24-
"Development Status :: 5 - Production/Stable",
24+
"Development Status :: 4 - Beta",
2525
"Intended Audience :: Science/Research",
2626
"License :: OSI Approved :: MIT License",
2727
"Operating System :: OS Independent",

statclean/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
Designed for academic research, data science, and statistical analysis.
1212
"""
1313

14-
__version__ = '0.1.0'
14+
__version__ = '0.1.3'
1515
__author__ = 'Subashanan Nair'
1616

1717
from .cleaner import StatClean

0 commit comments

Comments
 (0)