Skip to content

Latest commit

 

History

History
452 lines (343 loc) · 12.5 KB

File metadata and controls

452 lines (343 loc) · 12.5 KB

DataAiPrep Logo

DataAiPrep

Advanced ML Data Quality Assessment Platform

FeaturesInstallationQuick StartDocumentationContributing

Python 3.8+ MIT License Version 2.0.0


Overview

DataAiPrep is a comprehensive, enterprise-grade data preprocessing and quality assessment tool that goes beyond PyCaret's capabilities with expert-level features. It provides automated detection of data quality issues critical for machine learning success, including advanced leakage detection, SHAP explainability, and scalable processing with Dask.

Why DataAiPrep?

Challenge DataAiPrep Solution
Train-test data leakage 97.2% detection accuracy for contamination
Feature selection complexity Boruta, RFE, LASSO, mRMR with consensus voting
Model interpretability True SHAP integration with TreeExplainer
Large dataset processing Dask distributed computing support
CI/CD integration Quality gates with Slack/Email/Webhook alerts

Features

🔍 Core Analysis Modules

  • Advanced Missing Value Analysis: MCAR/MAR/MNAR classification with smart imputation recommendations
  • Ensemble Outlier Detection: IQR, Z-score, Isolation Forest, LOF, DBSCAN with consensus voting
  • Data Drift Detection: PSI, Kolmogorov-Smirnov test, Jensen-Shannon divergence
  • Fairness & Bias Analysis: Class imbalance, proxy detection, intersectional analysis

🎯 Advanced Features (v2.0)

  • Advanced Leakage Detection

    • Train-test contamination (97.2% accuracy)
    • Near-duplicate identification (94.8% accuracy)
    • Group/entity leakage (92.3% accuracy)
    • Target leakage detection
  • Automated Feature Selection

    • Boruta algorithm
    • Recursive Feature Elimination (RFE) with CV
    • LASSO/Elastic Net regularization
    • mRMR (minimum Redundancy Maximum Relevance)
    • Variance threshold filtering
    • Correlation-based filtering
    • Consensus voting across methods
  • SHAP Explainability

    • Global feature importance
    • Local instance explanations
    • Feature interaction detection
    • TreeExplainer and KernelExplainer support
  • Scalability & Performance

    • Dask distributed processing
    • Chunked processing for large files
    • Automatic dtype downcasting
    • Memory profiling and optimization
    • Intelligent sampling strategies
  • MLOps Integration

    • Quality gates for CI/CD pipelines
    • Slack notifications
    • Email reports
    • Webhook support
    • MLflow integration
    • Weights & Biases integration
  • Interactive Reporting

    • Plotly-based interactive HTML reports
    • PDF export with WeasyPrint
    • Model Cards generation
    • Data Sheets for datasets
    • Dark/Light themes

Installation

Basic Installation

pip install dataiprep

From Source

git clone https://github.com/massaoudi-lab/dataiprep.git
cd dataiprep
pip install -e .

Full Installation (with all optional features)

pip install dataiprep[full]
# or
pip install -r requirements.txt
pip install shap dask[complete] mlflow wandb

Minimal Installation (GUI only)

pip install PyQt6 PyQt6-SVG pandas numpy scikit-learn matplotlib seaborn

Quick Start

GUI Application

python main.py
# or
python main.py --gui

CLI Usage

# Basic data quality analysis
python main.py analyze data.csv --target label --output report.html

# With specific modules
python main.py analyze data.csv --modules completeness,outliers,drift

# Drift detection with baseline
python main.py analyze production.csv --baseline training.csv --target label

# Feature selection
python main.py select-features data.csv --target label --methods boruta,rfe,lasso

# Leakage detection
python main.py detect-leakage train.csv test.csv --target label

# Generate interactive report
python main.py report data.csv --output report.html --theme dark

Web Demo

python main.py --web --port 8000

Python API

from src.advanced import DataQualityPipeline
import pandas as pd

# Load data
df = pd.read_csv("your_data.csv")

# Create pipeline
pipeline = DataQualityPipeline(name="MyAnalysis")
pipeline.add_step('completeness', threshold=0.95)
pipeline.add_step('outlier_detection', methods=['iqr', 'isolation_forest', 'lof'])
pipeline.add_step('explainability')

# Run analysis
pipeline.run(data=df, target='label')

# Export results
pipeline.export('quality_report.html', format='html')

# Get recommendations
for rec in pipeline.get_recommendations(severity='high'):
    print(f"[{rec['severity']}] {rec['issue']}")

Advanced Usage Examples

Feature Selection with Consensus Voting

from src.advanced import AdvancedFeatureSelector
import pandas as pd

df = pd.read_csv("data.csv")
X = df.drop(columns=['target'])
y = df['target']

selector = AdvancedFeatureSelector(random_state=42)
results = selector.select_features(
    X, y,
    methods=['boruta', 'rfe', 'lasso', 'mrmr'],
    n_features=20,
    consensus_threshold=0.5
)

# Features selected by multiple methods
consensus_features = results['consensus_features']
print(f"Consensus features: {consensus_features}")

Leakage Detection

from src.advanced import AdvancedLeakageDetector
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

detector = AdvancedLeakageDetector(
    similarity_threshold=0.95,
    correlation_threshold=0.95
)

results = detector.detect_all(
    train_data=train_df,
    test_data=test_df,
    target_column='label',
    entity_column='customer_id'
)

if results['summary']['has_contamination']:
    print("⚠️ Train-test contamination detected!")

SHAP Explainability

from src.advanced import SHAPExplainer
import pandas as pd

df = pd.read_csv("data.csv")
X = df.drop(columns=['target'])
y = df['target']

explainer = SHAPExplainer(random_state=42)
results = explainer.explain(X, y)

# Global feature importance
for feat, importance in list(results['global_importance'].items())[:10]:
    print(f"{feat}: {importance:.4f}")

Quality Gates for CI/CD

from src.advanced import QualityGate, AlertManager, SlackChannel

# Set up quality gate
gate = QualityGate(name="ProductionReadiness")
gate.add_check('completeness', 'completeness_score', 'gte', 95, severity='error')
gate.add_check('outliers', 'outlier_percentage', 'lte', 2, severity='warning')

# Evaluate metrics
result = gate.evaluate({
    'completeness_score': 98,
    'outlier_percentage': 1.5
})

if not result['passed']:
    print("❌ Quality gate failed!")
    # Send alert
    alert_manager = AlertManager()
    slack = SlackChannel(webhook_url="YOUR_WEBHOOK_URL")
    alert_manager.add_channel('slack', slack)
    alert_manager.send_alert("Quality gate failed", severity='error')

Processing Large Datasets with Dask

from src.advanced import LargeDataProcessor

processor = LargeDataProcessor(
    chunk_size=100000,
    use_dask=True,
    n_workers=4
)

# Read large CSV efficiently
df = processor.read_large_csv("large_file.csv", optimize_memory=True)

# Get memory statistics
memory_stats = processor.get_memory_usage(df)
print(f"Memory usage: {memory_stats['total_mb']:.2f} MB")

Project Structure

dataiprep/
├── main.py                    # Entry point (GUI, CLI, Web)
├── requirements.txt           # Dependencies
├── setup.py                   # Package setup
├── data_ai_prep_logo.svg      # Application logo
├── src/
│   ├── advanced/              # Advanced analysis modules
│   │   ├── advanced_leakage.py
│   │   ├── alerting.py
│   │   ├── drift_detector.py
│   │   ├── enhanced_reporting.py
│   │   ├── explainability.py
│   │   ├── fairness_analyzer.py
│   │   ├── feature_engineer.py
│   │   ├── feature_selection.py
│   │   ├── missing_value_analyzer.py
│   │   ├── outlier_detector.py
│   │   ├── pipeline.py
│   │   ├── scalability.py
│   │   ├── shap_explainer.py
│   │   └── time_series_analyzer.py
│   ├── analysis/              # Core analysis modules
│   │   ├── completeness_analyzer.py
│   │   ├── dimensionality_analyzer.py
│   │   ├── distribution_analyzer.py
│   │   ├── feature_quality_analyzer.py
│   │   └── leakage_detector.py
│   ├── benchmark/             # Benchmark dataset generation
│   ├── core/                  # Data loading utilities
│   ├── gui/                   # PyQt6 GUI components
│   ├── integrations/          # MLflow, W&B, CI/CD integrations
│   ├── pipeline/              # Preprocessing pipeline
│   ├── reports/               # Report generation
│   └── web/                   # FastAPI web demo
├── examples/                  # Usage examples
├── tests/                     # Unit tests
└── docs/                      # Documentation

Comparison with Other Tools

Feature DataAiPrep Pandas Profiling Great Expectations PyCaret
Train-Test Leakage Detection ✅ 97.2% Partial
Near-Duplicate Detection ✅ 94.8%
Boruta Feature Selection
SHAP Explainability
Slack/Email Alerts
Dask Distributed
Interactive Plotly Reports Partial
Quality Gates for CI/CD

Requirements

Core Dependencies

  • Python 3.8+
  • PyQt6 >= 6.6.0
  • pandas >= 2.0.0
  • NumPy >= 1.24.0
  • scikit-learn >= 1.3.0
  • matplotlib >= 3.8.0
  • seaborn >= 0.13.0
  • plotly >= 5.17.0
  • scipy >= 1.11.0

Optional Dependencies

  • SHAP (shap >= 0.44.0) - For explainability features
  • Dask (dask[complete] >= 2023.12.0) - For large-scale processing
  • MLflow (mlflow >= 2.9.0) - For experiment tracking
  • Weights & Biases (wandb >= 0.16.0) - For experiment tracking
  • WeasyPrint (weasyprint >= 60.0) - For PDF report generation

Documentation


Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Citation

If you use DataAiPrep in your research, please cite:

@article{MASSAOUDI2026102662,
title = {DataAiPrep: A comprehensive machine learning data quality assessment tool for training dataset optimization},
journal = {SoftwareX},
volume = {34},
pages = {102662},
year = {2026},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2026.102662},
url = {https://www.sciencedirect.com/science/article/pii/S235271102600155X},
author = {Mohamed Massaoudi and Maymouna {Ez Eddin}},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Support


Made with ❤️ by the DataAiPrep Team