- Getting Started
- GUI Application
- Command Line Interface
- Python API
- Analysis Modules
- Advanced Features
- Configuration
- Troubleshooting
# Basic installation
pip install dataiprep
# Full installation with all features
pip install dataiprep[full]
# From source
git clone https://github.com/massaoudi-lab/dataiprep.git
cd dataiprep
pip install -e .from src.advanced import DataQualityPipeline
import pandas as pd
# Load your data
df = pd.read_csv("your_data.csv")
# Create and run pipeline
pipeline = DataQualityPipeline(name="QuickAnalysis")
pipeline.add_step('completeness')
pipeline.add_step('outlier_detection')
pipeline.run(data=df)
# View results
print(pipeline.get_summary())python main.py
# or
python main.py --gui- Data Import: Drag and drop or use File menu to load CSV, Excel, or Parquet files
- Analysis Tabs: View results for different analysis modules
- Export: Save reports in HTML, JSON, or PDF format
- Load your data file
- Select target column (optional, for supervised analysis)
- Choose analysis modules
- Click "Analyze" button
- View results in respective tabs
# Analyze a CSV file
python main.py analyze data.csv --target label --output report.html
# Specify modules
python main.py analyze data.csv --modules completeness,outliers,drift
# With baseline for drift detection
python main.py analyze current.csv --baseline train.csv --target label| Command | Description |
|---|---|
analyze |
Run data quality analysis |
select-features |
Advanced feature selection |
detect-leakage |
Train-test leakage detection |
report |
Generate interactive report |
python main.py select-features data.csv \
--target label \
--methods boruta,rfe,lasso,mrmr \
--n-features 20 \
--output features.jsonpython main.py detect-leakage train.csv test.csv \
--target label \
--entity customer_id \
--output leakage_report.jsonThe main entry point for comprehensive analysis:
from src.advanced import DataQualityPipeline
pipeline = DataQualityPipeline(name="MyAnalysis")
# Add analysis steps
pipeline.add_step('completeness', threshold=0.95)
pipeline.add_step('outlier_detection', methods=['iqr', 'zscore', 'isolation_forest'])
pipeline.add_step('drift_detection', baseline=train_data)
pipeline.add_step('fairness', protected_cols=['gender', 'race'])
# Run pipeline
pipeline.run(data=df, target='label', verbose=True)
# Get results
summary = pipeline.get_summary()
recommendations = pipeline.get_recommendations()
# Export
pipeline.export('report.html', format='html')from src.advanced import (
AdvancedMissingAnalyzer,
EnsembleOutlierDetector,
DriftAnalyzer,
FairnessAnalyzer
)
# Missing value analysis
analyzer = AdvancedMissingAnalyzer()
results = analyzer.analyze(df)
# Outlier detection
detector = EnsembleOutlierDetector(methods=['iqr', 'zscore', 'isolation_forest'])
results = detector.detect(df)
# Drift detection
drift = DriftAnalyzer()
results = drift.analyze(current_data, baseline_data)
# Fairness analysis
fairness = FairnessAnalyzer(protected_columns=['gender'])
results = fairness.analyze(df, target='label')Analyzes missing values and their patterns:
- Total missing percentage
- Missing by column
- MCAR/MAR/MNAR classification
- Imputation recommendations
Multi-method ensemble approach:
- IQR (Interquartile Range)
- Z-score
- Isolation Forest
- Local Outlier Factor (LOF)
- DBSCAN clustering
- Consensus voting
Detects distribution changes:
- Population Stability Index (PSI)
- Kolmogorov-Smirnov test
- Jensen-Shannon divergence
- Chi-square test (categorical)
Bias and fairness assessment:
- Class imbalance
- Difference in proportions
- Proxy detection
- Intersectional analysis
from src.advanced import AdvancedFeatureSelector
selector = AdvancedFeatureSelector(random_state=42)
results = selector.select_features(
X, y,
methods=['boruta', 'rfe', 'lasso', 'mrmr'],
n_features=20,
consensus_threshold=0.5
)
# Get consensus features
consensus = results['consensus_features']from src.advanced import SHAPExplainer
explainer = SHAPExplainer(random_state=42)
results = explainer.explain(X, y)
# Global importance
for feat, importance in results['global_importance'].items():
print(f"{feat}: {importance:.4f}")from src.advanced import AdvancedLeakageDetector
detector = AdvancedLeakageDetector()
results = detector.detect_all(
train_data,
test_data,
target_column='label',
entity_column='customer_id'
)
if results['summary']['has_contamination']:
print("Warning: Train-test contamination detected!")from src.advanced import LargeDataProcessor
processor = LargeDataProcessor(
chunk_size=100000,
use_dask=True,
n_workers=4
)
df = processor.read_large_csv("large_file.csv", optimize_memory=True)from src.advanced import QualityGate
gate = QualityGate(name="ProductionReadiness")
gate.add_check('completeness', 'completeness_score', 'gte', 95, severity='error')
gate.add_check('outliers', 'outlier_percentage', 'lte', 2, severity='warning')
result = gate.evaluate(metrics)
if not result['passed']:
print("Quality gate failed!")| Variable | Description | Default |
|---|---|---|
DATAIPREP_LOG_LEVEL |
Logging level | INFO |
DATAIPREP_THEME |
GUI theme | dark |
DATAIPREP_WORKERS |
Dask workers | 4 |
Create dataiprep.yaml:
analysis:
completeness_threshold: 0.95
outlier_methods:
- iqr
- zscore
- isolation_forest
reporting:
theme: dark
format: html
alerting:
slack_webhook: "https://hooks.slack.com/..."
email_recipients:
- team@example.comPyQt6 not found
pip install PyQt6 PyQt6-SVGSHAP not working
pip install shapMemory errors with large files
from src.advanced import LargeDataProcessor
processor = LargeDataProcessor(use_dask=True)
df = processor.read_large_csv("large_file.csv")Import errors
# Ensure you're in the correct directory
cd dataiprep
pip install -e .