A comprehensive toolkit for detecting data quality issues, label leakage, and spurious correlations in machine learning datasets. This project integrates multiple validation libraries and research methods to provide a unified interface for dataset auditing.
Why Dataset Autopsy? Data leakage and spurious correlations are among the most common causes of ML model failures in production. This toolkit helps catch these issues before they impact your models.
- Data Integrity Checks: Detect missing values, duplicates, and schema violations
- Label Leakage Detection: Identify features that directly or indirectly encode target information
- Spurious Correlation Detection: Find hidden biases and shortcut features
- Multi-Format Support: Works with tabular, image, and text datasets
- Pipeline Integration: Designed to run as pre-training validation steps
This toolkit integrates the following open-source libraries:
- Deepchecks: Feature-label correlation checks, identifier-label correlation detection
- Cleanlab Datalab: Automated data auditing with spurious correlation detection
- EvalML Data Checks: Target leakage detection using mutual information
- TensorFlow Data Validation (TFDV): Schema validation and anomaly detection
- Great Expectations: Custom data validation rules
pip install -r requirements.txtimport pandas as pd
from dataset_autopsy import DatasetAuditor
# Load your dataset
data = pd.read_csv('your_dataset.csv')
labels = data['target']
features = data.drop(columns=['target'])
# Run comprehensive audit
auditor = DatasetAuditor()
results = auditor.audit(
data=features,
labels=labels,
dataset_type='tabular'
)
# Generate detailed report
auditor.report(results, output_path='audit_report.txt')- ✅ Label Leakage: Features that directly encode target information
- ✅ Spurious Correlations: Hidden biases and shortcut features
- ✅ Data Quality Issues: Missing values, duplicates, schema violations
- ✅ Train-Test Contamination: Data leakage between splits
- ✅ Identifier Leakage: Time/ID fields correlating with targets
dataset-autopsy/
├── dataset_autopsy/ # Main package
│ ├── __init__.py
│ ├── deepchecks_integration.py
│ ├── cleanlab_integration.py
│ ├── evalml_integration.py
│ ├── tfdv_integration.py
│ └── utils.py
├── examples/ # Example notebooks and scripts
├── tests/ # Unit tests
├── requirements.txt
├── LICENSE
└── README.md
MIT License - see LICENSE file for details.
config = {
'evalml': {'leakage_threshold': 0.90},
'deepchecks': {'run_train_test': True}
}
auditor = DatasetAuditor(config=config)from dataset_autopsy.evalml_integration import EvalMLValidator
validator = EvalMLValidator({'leakage_threshold': 0.95})
results = validator.check_leakage(features, labels)This toolkit demonstrates:
- Library Integration: Seamless integration of multiple ML validation frameworks
- Unified API Design: Clean interface abstracting complexity
- Type Safety: Full type hints for better IDE support
- Error Handling: Robust error handling and validation
- Extensibility: Easy to add new validators
Contributions welcome! See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE file for details.
Raaga Karumanchi
- GitHub: @raagakarumanchi