A fully modular, reproducible machine-learning pipeline for distinguishing electron neutrinos (signal) from muon neutrinos (background) in the MiniBooNE particle-physics dataset.
This project provides:
- π Automated download, loading, cleaning, processing, and splitting of MiniBooNE PID data
- π Publication-quality physics-aware visualizations
- π§ Machine-learning support and extensibility (XGBoost, future deep models)
- π§ͺ A comprehensive, structured test suite (unit + integration + performance)
- π Modern project architecture following PyPA and scientific computing best practices
- π A research workflow with data lineage, reproducibility, and structured configuration
- Unified data ingestion (Kaggle or local files)
- Robust cleaning of NaNs, MiniBooNE sentinel values, and duplicates
- Physics-aware preprocessing and feature transformations
- Flexible outputs: NumPy arrays or Pandas DataFrames
- Signalβbackground separation plots (KDE, histograms)
- Correlation analysis and feature summary plots
- PCA, t-SNE, and other embedding visualizations
- Publication-ready scientific styling (LaTeX optional)
- Effect size computation (Cohenβs d, rank-biserial)
- Hypothesis testing and multi-comparison correction
- Feature separability scoring and ranking
- Mocked external dependencies (Kaggle API, I/O)
- Comprehensive unit tests for loader, cleaner, processor, and plotters
- Statistical validation tests
- Integration, smoke, and performance tests
- CI/CD via GitHub Actions
- YAML-driven logging (colored console, JSON optional)
- Standardized Makefile automation
- Modular and research-oriented
src/architecture
miniboone-classification/
βββ data/
β βββ external/ # Raw third-party datasets (e.g., MiniBooNE PID CSV)
β βββ processed/ # Cleaned + transformed datasets ready for modeling
βββ notebooks/ # Exploratory notebooks (numbered for execution order)
β βββ 01_data_exploration.ipynb
βββ src/
β βββ config/ # Centralized configuration & presets
β β βββ config.py # Pydantic-based config management
β β βββ logging.yaml # Logging configuration
β β βββ presets.py # Plotting & model presets
β βββ data/
β β βββ data_loader.py # Kaggle downloader & local loader
β β βββ data_cleaner.py # Missing data, outlier logic, physics adjustments
β β βββ data_processor.py # Feature builders, scaling pipeline, splits
β β βββ data_handler.py # High-level pipeline wrapper (load β clean β process)
β β
β βββ plotter/
β β βββ base_plotter.py # Scientific plotting setup + LaTeX styles
β β βββ neutrino_plotter.py # Physics-aware plots (signal vs background, correlations)
β β βββ dimensionality_plotter.py # PCA, t-SNE, embeddings
β βββ stats/
β β βββ statistical_analysis.py # Statistical tests, effect sizes, corrections
β βββ styles/
β β βββ plot_style.py # Global Matplotlib/SciPlot styling
β βββ utils/
β βββ logger.py # Global logger loader (YAML-driven)
β βββ paths.py # Project-root resolution utilities
βββ tests/
β βββ conftest.py # Shared fixtures for all tests
β βββ integration/ # Combined integration tests
β βββ unit/ # Unit Tests for individual module parts
β βββ output/ # Temporary files generated during testing
β βββ reports/ # Coverage & diagnostics (HTML reports)
β βββ test_smokes.py # Quick, fast-running smoke tests
βββ models/ # Trained ML models, exports, metadata
βββ logs/ # Log files (if enabled in logging.yaml)
βββ output/ # Generated files from running the module
βββ tmp/ # Temporary scripts & scratch files
βββ Makefile # Build, test, clean, format commands
βββ pyproject.toml # Dependency & build configuration
βββ setup.cfg # Linting/formatting settings
βββ LICENSE
βββ README.md git clone https://github.com/mchadolias/miniboone-classification
cd miniboone-classificationmkdir -p ~/.config/kaggle
cp kaggle.json ~/.config/kaggle/
chmod 600 ~/.config/kaggle/kaggle.jsonuv sync # or: pip install -r requirements.txtfrom src.data.data_handler import MiniBooNEDataHandler
handler = MiniBooNEDataHandler()
# Run full pipeline
df_clean, splits, pipeline = handler.run()
X_train, y_train = splits["train"]from src.plotter.neutrino_plotter import NeutrinoPlotter
plotter = NeutrinoPlotter()
plotter.plot_feature_separation(df_clean, features=["feature_1", "feature_5"])from src.plotter.dimensionality_reduction_plotter import DimensionalityReductionPlotter
dr = DimensionalityReductionPlotter()
fig = dr.plot_tsne_embedding(df_clean)make test # full test suite
make test-dev # fast local tests
make test-cov # coverage
make lint # static analysis
make format # format with black/isortThe MiniBooNE detector dataset contains:
- 50 reconstructed PMT & hit-structure features
- ~93k muon-neutrino background events
- ~36k electron-neutrino signal events
This project handles the common MiniBooNE preprocessing steps:
- Replace MiniBooNEβs sentinel missing value
-999 - Column-wise median imputation
- Feature scaling and (optional) transforms
- Train/val/test splitting with reproducible seeds
- Full data pipeline orchestration
- Physics-aware plotting module
- YAML logging system
- Advanced statistics module
- ML training pipeline (XGBoost, tabular NN)
- Feature importance + SHAP
- MLflow experiment tracking
- Hyperparameter search
- Add real detector-inspired feature engineering
MIT β see LICENSE.
Date: 06/12/2025
- Michalis Chadolias
- Email: mchadolias[@]gmail.com
- GitHub: https://github.com/mchadolias