A complete end-to-end machine learning pipeline for classifying pulsar stars using the HTRU2 dataset. Includes automated data processing, model training, evaluation, comparison plots, and optional deployment via FastAPI.
- Project Overview
- Quick Start
- Usage
- Project Structure
- Features
- Configuration
- Model Performance
- Deployment
- Troubleshooting
- References
- License
- Contributing
This project focuses on classifying pulsar stars using the HTRU2 dataset from Kaggle. Pulsars are rare and valuable astronomical objects, and accurate classification is crucial for astronomical research. The project implements a complete machine learning pipeline from data acquisition to model deployment.
Pulsars are rare neutron stars that produce valuable scientific data. Manual classification is time-consuming and prone to error. This project aims to automate pulsar classification using machine learning to assist astronomers in identifying genuine pulsar signals from noise.
- Source: HTRU2 Dataset from Kaggle
- Samples: 17,898 total instances
- Features: 8 numerical features derived from pulsar candidate profiles
- Target: Binary classification (0 = non-pulsar, 1 = pulsar)
- Class Distribution: Highly imbalanced (~90.84% non-pulsars, ~9.16% pulsars)
The HTRU2 dataset contains 8 features derived from the integrated pulse profile and DM-SNR curve:
-
Integrated Profile Features:
ip_mean: Mean of the integrated profileip_std: Standard deviation of the integrated profileip_kurtosis: Kurtosis of the integrated profileip_skewness: Skewness of the integrated profile
-
DM-SNR Curve Features:
dm_mean: Mean of the DM-SNR curvedm_std: Standard deviation of the DM-SNR curvedm_kurtosis: Kurtosis of the DM-SNR curvedm_skewness: Skewness of the DM-SNR curve
-
Target:
signal: Class label (0: noise, 1: pulsar)
- Python 3.10+
- uv package manager https://docs.astral.sh/uv/
- Kaggle API credentials (
~/.kaggle/kaggle.json)
Clone the repository:
git clone git@github.com:mchadolias/pulsar-classification.git
cd pulsar-classificationmkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json# Core dependencies
uv sync
# Development tools
uv sync --extra dev
# Training dependencies
uv sync --extra training
# Everything (recommended)
uv sync --extra dev --extra traininguv run python scripts/setup_directories.pyThis project includes a self-documented Makefile. Run:
make help| Command | Description |
|---|---|
make train |
Run full ML pipeline (download → preprocess → train → evaluate) |
make train-test |
Run minimal full ML pipeline (download → preprocess → train → evaluate) |
make plot |
Generate all comparison plots (PNG + PDF) |
make plot-pdf |
Only PDF comparison report |
make plot-png |
Only PNG comparison plots |
make data |
Set up directories and run a dry-run pipeline |
make clean |
Remove Python caches |
make clean-outputs |
Remove models, metrics, and plots |
make install-all |
Install all dependencies |
make install-dev |
Install development dependencies |
make install-training |
Install training dependencies |
make trainmake plotpulsar-classification/
├── configs/ # TOML configuration files
│ ├── model.toml # Main model and training configuration
│ └── test_model.toml # Lightweight config for test/debug runs
├── data/
│ ├── external/ # Raw data downloaded from Kaggle
│ └── processed/ # Cleaned & split dataset used for training
├── deployment/
│ ├── client.py # CLI client for sending API prediction requests
│ ├── deploy_to_hf.sh # Helper script for pushing to Hugging Face Spaces
│ ├── predict.py # Local FastAPI prediction app
│ ├── examples/ # Example prediction payloads for testing the API
│ │ ├── batch_prediction.json
│ │ ├── single_prediction.json
│ │ └── test_cases.json
│ └── huggingface-space/ # Full deployable Hugging Face Space
│ ├── app.py # Main application entrypoint for HF Space
│ ├── deployment/predict.py # HF-specific prediction wrapper
│ ├── Dockerfile # HF container specification
│ ├── models/ # Prepackaged model for Space execution
│ ├── README.md
│ └── requirements.txt
├── docs/
│ ├── DEPLOYMENT.md # Instructions for Docker/HF deployment
│ └── MODEL_PERFOMANCE.md # Detailed model performance report
├── logs/ # Logs generated during pipeline execution
├── notebooks/
│ ├── 01_data_exploration.ipynb # Exploratory data analysis notebook
│ └── 02_kaggle_submission.ipynb # Notebook used for Kaggle submission
├── outputs/
│ ├── metrics/ # Saved metrics, histories, and poller state
│ │ ├── data_balance.json
│ │ ├── final_results.json
│ │ ├── model_comparison.json
│ │ ├── model_prediction_history.json
│ │ ├── training_history.json
│ │ └── poller/poller.json # Central metric store for plotting
│ ├── models/ # Serialized trained models
│ ├── plots/
│ │ ├── training/ # Per-model training plots (ROC/PR/threshold)
│ │ └── comparison/ # Combined plots from ModelComparisonPlotter
│ │ └── model_comparison_report.pdf
│ ├── predictions/ # Test predictions & probability files
│ │ ├── *_test_predictions.csv
│ │ └── *_prediction_probabilities.csv
│ └── screenshot/ # Screenshots used for documentation
├── scripts/
│ ├── run_training.py # Main training pipeline (full workflow)
│ ├── run_plotting.py # Standalone comparison plotting script
│ ├── setup_directories.py # Creates folder structure for data/outputs
│ └── debug_toml.py # Helper tool for debugging TOML configs
├── src/
│ ├── config.py # Central configuration loader
│ ├── data_handler.py # Data downloading, validation, and preprocessing
│ ├── plotter.py # ModelComparisonPlotter (all combined plots)
│ ├── poller.py # Poller: stores metrics & curves to JSON
│ ├── training.py # ModelTrainer: training, evaluation, tuning
│ ├── utils.py # Logging, saving utilities, helpers
│ └── __init__.py
├── Makefile # UV-aware workflow with train/plot/clean commands
├── Dockerfile # Production API container definition
├── pyproject.toml # Project metadata & dependency management
├── requirements.txt # Optional pinned requirements
├── uv.lock # UV lockfile (exact dependency versions)
├── setup.cfg # Linting & formatting configuration
├── LICENSE
└── README.md
- Automated download (Kaggle API)
- Column renaming + numerical rounding
- Missing value validation
- Train/Val/Test stratified split
- JSON export of class balance
- Logistic Regression, Random Forest, Gradient Boosting, XGBoost
- Hyperparameter optimisation (Grid / Random / Halving)
- F₂-optimised threshold selection
- Feature importance extraction
- Calibration curves
- ROC & PR curves
- Confusion matrices
- F1/F2 threshold curves
- Correlation heatmap
- Summary comparison bar charts
- Poller JSON capturing all model metadata
[logistic_regression]
max_iter = 500
solver = "lbfgs"
[random_forest]
n_estimators = 300
max_depth = 10- Input/output paths
- Train/val/test ratios
- Random seeds
Selected best model: Random Forest (recall-focused, F₂-optimised)
Test set performance (optimal threshold ≈ 0.36)
- F₂-score: 0.8923
- F₁-score: 0.8819
- Recall: 0.8994
- Precision: 0.8651
- ROC–AUC: 0.9747
- PR–AUC: 0.9306
All four classifiers (Logistic Regression, Random Forest, Gradient Boosting, XGBoost) reach ROC–AUC values around 0.97 on the validation set, which suggests that the HTRU2 dataset is relatively easy to separate using the chosen features.
Top features for the selected Random Forest model:
ip_kurtosisip_meanip_skewness
See the full report in docs/MODEL_PERFOMANCE.md.
# Build and run Docker container
docker build -t pulsar-classification-api:latest .
docker run -it -p 9696:9696 pulsar-classification-api:latestPOST /predict- Single sample predictionPOST /predict_batch- Batch predictionGET /health- Service health checkGET /features- Expected feature names
- Docker: Local container deployment
- Fly.io: Cloud deployment (demonstrated)
- Hugging Face Spaces: mchadolias/pulsar-classification-htru2
Full Deployment Guide: See DEPLOYMENT.md for detailed instructions.
-
Port already in use
docker run -it -p 9698:9696 pulsar-classification-api:latest
-
Model file not found
- Ensure
best_xgboost_model.pklexists inoutputs/models/
- Ensure
-
Feature validation errors
- Verify exactly 8 features are provided
- Check feature order matches
/featuresendpoint
-
Docker build failures
- Keep in mind that building issues have been observed when using a VPN, so rebuild it after you are disconnected.
docker build --no-cache -t pulsar-classification-api:latest . -
Kaggle API Errors
- Verify
kaggle.jsoncredentials - Check internet connection
- Ensure dataset is publicly accessible
- Verify
-
Dependency Conflicts
- Use UV for isolated environments
- Check Python version compatibility
- Review dependency versions in
pyproject.toml
# Check container status
docker ps
# View logs
docker logs <container_id>
# Health check
curl http://localhost:9696/health- HTRU2 Dataset Repository (UC Irvine)
- Lyon, R. J., Stappers, B. W., et al. - Pulsar Classification
- Kaggle Pulsar Dataset
- FastAPI Documentation - Modern web framework for building APIs
- XGBoost Documentation - Scalable and accurate gradient boosting
- Scikit-learn Documentation - Machine learning in Python
- Pydantic Documentation - Data validation using Python type annotations
- UV Documentation - Fast Python package and project manager
- Docker Documentation - Containerization platform
- Uvicorn Documentation - ASGI web server implementation
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Last Updated: 08-12-2025
Last Pipeline Execution: 07-12-2025
Author: @mchadolias


