Skip to content

raagakarumanchi/dataset-autopsy

Repository files navigation

Dataset Autopsy 🔍

Python 3.8+ License: MIT Code style: black

A comprehensive toolkit for detecting data quality issues, label leakage, and spurious correlations in machine learning datasets. This project integrates multiple validation libraries and research methods to provide a unified interface for dataset auditing.

Why Dataset Autopsy? Data leakage and spurious correlations are among the most common causes of ML model failures in production. This toolkit helps catch these issues before they impact your models.

Features

  • Data Integrity Checks: Detect missing values, duplicates, and schema violations
  • Label Leakage Detection: Identify features that directly or indirectly encode target information
  • Spurious Correlation Detection: Find hidden biases and shortcut features
  • Multi-Format Support: Works with tabular, image, and text datasets
  • Pipeline Integration: Designed to run as pre-training validation steps

Supported Libraries

This toolkit integrates the following open-source libraries:

Installation

pip install -r requirements.txt

Quick Start

import pandas as pd
from dataset_autopsy import DatasetAuditor

# Load your dataset
data = pd.read_csv('your_dataset.csv')
labels = data['target']
features = data.drop(columns=['target'])

# Run comprehensive audit
auditor = DatasetAuditor()
results = auditor.audit(
    data=features,
    labels=labels,
    dataset_type='tabular'
)

# Generate detailed report
auditor.report(results, output_path='audit_report.txt')

What It Detects

  • Label Leakage: Features that directly encode target information
  • Spurious Correlations: Hidden biases and shortcut features
  • Data Quality Issues: Missing values, duplicates, schema violations
  • Train-Test Contamination: Data leakage between splits
  • Identifier Leakage: Time/ID fields correlating with targets

Project Structure

dataset-autopsy/
├── dataset_autopsy/          # Main package
│   ├── __init__.py
│   ├── deepchecks_integration.py
│   ├── cleanlab_integration.py
│   ├── evalml_integration.py
│   ├── tfdv_integration.py
│   └── utils.py
├── examples/                  # Example notebooks and scripts
├── tests/                     # Unit tests
├── requirements.txt
├── LICENSE
└── README.md

License

MIT License - see LICENSE file for details.

Advanced Usage

Custom Configuration

config = {
    'evalml': {'leakage_threshold': 0.90},
    'deepchecks': {'run_train_test': True}
}
auditor = DatasetAuditor(config=config)

Individual Validator Usage

from dataset_autopsy.evalml_integration import EvalMLValidator

validator = EvalMLValidator({'leakage_threshold': 0.95})
results = validator.check_leakage(features, labels)

Technical Details

This toolkit demonstrates:

  • Library Integration: Seamless integration of multiple ML validation frameworks
  • Unified API Design: Clean interface abstracting complexity
  • Type Safety: Full type hints for better IDE support
  • Error Handling: Robust error handling and validation
  • Extensibility: Easy to add new validators

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE file for details.

Author

Raaga Karumanchi

About

Comprehensive ML dataset validation toolkit - detects label leakage, spurious correlations, and data quality issues before model training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages