Skip to content

This project builds a reproducible machine-learning pipeline to distinguish electron from muon neutrinos in the MiniBooNE neutrino dataset, using uv-managed environments, scikit-learn and XGBoost models, Optuna for hyperparameter optimisation, and MLflow for experiment tracking and model logging.

License

Notifications You must be signed in to change notification settings

mchadolias/miniboone-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MiniBooNE Particle Classification

Tests

A fully modular, reproducible machine-learning pipeline for distinguishing electron neutrinos (signal) from muon neutrinos (background) in the MiniBooNE particle-physics dataset.

This project provides:

  • πŸš€ Automated download, loading, cleaning, processing, and splitting of MiniBooNE PID data
  • πŸ“Š Publication-quality physics-aware visualizations
  • 🧠 Machine-learning support and extensibility (XGBoost, future deep models)
  • πŸ§ͺ A comprehensive, structured test suite (unit + integration + performance)
  • πŸ›  Modern project architecture following PyPA and scientific computing best practices
  • πŸ” A research workflow with data lineage, reproducibility, and structured configuration

πŸš€ Key Features

🧬 Pipeline

  • Unified data ingestion (Kaggle or local files)
  • Robust cleaning of NaNs, MiniBooNE sentinel values, and duplicates
  • Physics-aware preprocessing and feature transformations
  • Flexible outputs: NumPy arrays or Pandas DataFrames

πŸ“Š Visualization

  • Signal–background separation plots (KDE, histograms)
  • Correlation analysis and feature summary plots
  • PCA, t-SNE, and other embedding visualizations
  • Publication-ready scientific styling (LaTeX optional)

πŸ“ Statistical Toolkit

  • Effect size computation (Cohen’s d, rank-biserial)
  • Hypothesis testing and multi-comparison correction
  • Feature separability scoring and ranking

πŸ§ͺ Testing Framework

  • Mocked external dependencies (Kaggle API, I/O)
  • Comprehensive unit tests for loader, cleaner, processor, and plotters
  • Statistical validation tests
  • Integration, smoke, and performance tests

πŸ— Engineering & Workflow

  • CI/CD via GitHub Actions
  • YAML-driven logging (colored console, JSON optional)
  • Standardized Makefile automation
  • Modular and research-oriented src/ architecture

πŸ“ Project Structure

miniboone-classification/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ external/                         # Raw third-party datasets (e.g., MiniBooNE PID CSV)
β”‚   └── processed/                        # Cleaned + transformed datasets ready for modeling
β”œβ”€β”€ notebooks/                            # Exploratory notebooks (numbered for execution order)
β”‚   └── 01_data_exploration.ipynb
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config/                           # Centralized configuration & presets
β”‚   β”‚   β”œβ”€β”€ config.py                     # Pydantic-based config management
β”‚   β”‚   β”œβ”€β”€ logging.yaml                  # Logging configuration
β”‚   β”‚   └── presets.py                    # Plotting & model presets
β”‚   β”œβ”€β”€ data/                       
β”‚   β”‚   β”œβ”€β”€ data_loader.py                # Kaggle downloader & local loader
β”‚   β”‚   β”œβ”€β”€ data_cleaner.py               # Missing data, outlier logic, physics adjustments
β”‚   β”‚   β”œβ”€β”€ data_processor.py             # Feature builders, scaling pipeline, splits
β”‚   β”‚   └── data_handler.py               # High-level pipeline wrapper (load β†’ clean β†’ process)
β”‚   β”‚
β”‚   β”œβ”€β”€ plotter/                    
β”‚   β”‚   β”œβ”€β”€ base_plotter.py               # Scientific plotting setup + LaTeX styles
β”‚   β”‚   β”œβ”€β”€ neutrino_plotter.py           # Physics-aware plots (signal vs background, correlations)
β”‚   β”‚   └── dimensionality_plotter.py     # PCA, t-SNE, embeddings
β”‚   β”œβ”€β”€ stats/
β”‚   β”‚   └── statistical_analysis.py       # Statistical tests, effect sizes, corrections
β”‚   β”œβ”€β”€ styles/
β”‚   β”‚   └── plot_style.py                 # Global Matplotlib/SciPlot styling
β”‚   └── utils/
β”‚       β”œβ”€β”€ logger.py                     # Global logger loader (YAML-driven)
β”‚       └── paths.py                      # Project-root resolution utilities
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py                       # Shared fixtures for all tests
β”‚   β”œβ”€β”€ integration/                      # Combined integration tests
β”‚   β”œβ”€β”€ unit/                             # Unit Tests for individual module parts
β”‚   β”œβ”€β”€ output/                           # Temporary files generated during testing
β”‚   β”œβ”€β”€ reports/                          # Coverage & diagnostics (HTML reports)
β”‚   └── test_smokes.py                    # Quick, fast-running smoke tests
β”œβ”€β”€ models/                               # Trained ML models, exports, metadata
β”œβ”€β”€ logs/                                 # Log files (if enabled in logging.yaml)
β”œβ”€β”€ output/                               # Generated files from running the module
β”œβ”€β”€ tmp/                                  # Temporary scripts & scratch files
β”œβ”€β”€ Makefile                              # Build, test, clean, format commands
β”œβ”€β”€ pyproject.toml                        # Dependency & build configuration
β”œβ”€β”€ setup.cfg                             # Linting/formatting settings
β”œβ”€β”€ LICENSE
└── README.md                      

πŸ›  Installation

git clone https://github.com/mchadolias/miniboone-classification
cd miniboone-classification

Kaggle setup

mkdir -p ~/.config/kaggle
cp kaggle.json ~/.config/kaggle/
chmod 600 ~/.config/kaggle/kaggle.json

Install dependencies

uv sync    # or: pip install -r requirements.txt

🎯 Usage Example

Load β†’ Clean β†’ Process the dataset

from src.data.data_handler import MiniBooNEDataHandler

handler = MiniBooNEDataHandler()

# Run full pipeline
df_clean, splits, pipeline = handler.run()

X_train, y_train = splits["train"]

Generate physics plots

from src.plotter.neutrino_plotter import NeutrinoPlotter

plotter = NeutrinoPlotter()
plotter.plot_feature_separation(df_clean, features=["feature_1", "feature_5"])

Dimensionality reduction

from src.plotter.dimensionality_reduction_plotter import DimensionalityReductionPlotter

dr = DimensionalityReductionPlotter()
fig = dr.plot_tsne_embedding(df_clean)

πŸ§ͺ Testing Command Sheet

make test            # full test suite
make test-dev        # fast local tests
make test-cov        # coverage
make lint            # static analysis
make format          # format with black/isort

πŸ“Š About the Data

The MiniBooNE detector dataset contains:

  • 50 reconstructed PMT & hit-structure features
  • ~93k muon-neutrino background events
  • ~36k electron-neutrino signal events

This project handles the common MiniBooNE preprocessing steps:

  • Replace MiniBooNE’s sentinel missing value -999
  • Column-wise median imputation
  • Feature scaling and (optional) transforms
  • Train/val/test splitting with reproducible seeds

πŸ“š Roadmap

  • Full data pipeline orchestration
  • Physics-aware plotting module
  • YAML logging system
  • Advanced statistics module
  • ML training pipeline (XGBoost, tabular NN)
  • Feature importance + SHAP
  • MLflow experiment tracking
  • Hyperparameter search
  • Add real detector-inspired feature engineering

πŸ“ License

MIT β€” see LICENSE.

πŸ“… Last Updated

Date: 06/12/2025

πŸ‘€ Author

About

This project builds a reproducible machine-learning pipeline to distinguish electron from muon neutrinos in the MiniBooNE neutrino dataset, using uv-managed environments, scikit-learn and XGBoost models, Optuna for hyperparameter optimisation, and MLflow for experiment tracking and model logging.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published