🌌 Pulsar Star Classification Project

A complete end-to-end machine learning pipeline for classifying pulsar stars using the HTRU2 dataset. Includes automated data processing, model training, evaluation, comparison plots, and optional deployment via FastAPI.

📋 Table of Contents

Project Overview
Quick Start
Usage
Project Structure
Features
Configuration
Model Performance
Deployment
Troubleshooting
References
License
Contributing

📋 Project Overview

This project focuses on classifying pulsar stars using the HTRU2 dataset from Kaggle. Pulsars are rare and valuable astronomical objects, and accurate classification is crucial for astronomical research. The project implements a complete machine learning pipeline from data acquisition to model deployment.

🎯 Business Problem

Pulsars are rare neutron stars that produce valuable scientific data. Manual classification is time-consuming and prone to error. This project aims to automate pulsar classification using machine learning to assist astronomers in identifying genuine pulsar signals from noise.

📊 Dataset

Source: HTRU2 Dataset from Kaggle
Samples: 17,898 total instances
Features: 8 numerical features derived from pulsar candidate profiles
Target: Binary classification (0 = non-pulsar, 1 = pulsar)
Class Distribution: Highly imbalanced (~90.84% non-pulsars, ~9.16% pulsars)

Dataset Features

The HTRU2 dataset contains 8 features derived from the integrated pulse profile and DM-SNR curve:

Integrated Profile Features:
- ip_mean: Mean of the integrated profile
- ip_std: Standard deviation of the integrated profile
- ip_kurtosis: Kurtosis of the integrated profile
- ip_skewness: Skewness of the integrated profile
DM-SNR Curve Features:
- dm_mean: Mean of the DM-SNR curve
- dm_std: Standard deviation of the DM-SNR curve
- dm_kurtosis: Kurtosis of the DM-SNR curve
- dm_skewness: Skewness of the DM-SNR curve
Target:
- signal: Class label (0: noise, 1: pulsar)

🚀 Quick Start

Prerequisites

Python 3.10+
uv package manager https://docs.astral.sh/uv/
Kaggle API credentials (~/.kaggle/kaggle.json)

Installation

Clone the repository:

git clone git@github.com:mchadolias/pulsar-classification.git
cd pulsar-classification

Set up Kaggle API

mkdir -p ~/.kaggle
cp kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Install dependencies via uv

# Core dependencies
uv sync

# Development tools
uv sync --extra dev

# Training dependencies
uv sync --extra training

# Everything (recommended)
uv sync --extra dev --extra training

Create data directories

uv run python scripts/setup_directories.py

Makefile Commands

This project includes a self-documented Makefile. Run:

make help

Common commands

Command	Description
`make train`	Run full ML pipeline (download → preprocess → train → evaluate)
`make train-test`	Run minimal full ML pipeline (download → preprocess → train → evaluate)
`make plot`	Generate all comparison plots (PNG + PDF)
`make plot-pdf`	Only PDF comparison report
`make plot-png`	Only PNG comparison plots
`make data`	Set up directories and run a dry-run pipeline
`make clean`	Remove Python caches
`make clean-outputs`	Remove models, metrics, and plots
`make install-all`	Install all dependencies
`make install-dev`	Install development dependencies
`make install-training`	Install training dependencies

Usage

Full training pipeline

make train

Generate comparison plots without re-training

make plot

Project Structure

pulsar-classification/
├── configs/                         # TOML configuration files
│   ├── model.toml                   # Main model and training configuration
│   └── test_model.toml              # Lightweight config for test/debug runs
├── data/
│   ├── external/                    # Raw data downloaded from Kaggle
│   └── processed/                   # Cleaned & split dataset used for training
├── deployment/
│   ├── client.py                    # CLI client for sending API prediction requests
│   ├── deploy_to_hf.sh              # Helper script for pushing to Hugging Face Spaces
│   ├── predict.py                   # Local FastAPI prediction app
│   ├── examples/                    # Example prediction payloads for testing the API
│   │   ├── batch_prediction.json
│   │   ├── single_prediction.json
│   │   └── test_cases.json
│   └── huggingface-space/           # Full deployable Hugging Face Space
│       ├── app.py                   # Main application entrypoint for HF Space
│       ├── deployment/predict.py    # HF-specific prediction wrapper
│       ├── Dockerfile               # HF container specification
│       ├── models/                  # Prepackaged model for Space execution
│       ├── README.md
│       └── requirements.txt
├── docs/
│   ├── DEPLOYMENT.md                # Instructions for Docker/HF deployment
│   └── MODEL_PERFOMANCE.md          # Detailed model performance report
├── logs/                            # Logs generated during pipeline execution
├── notebooks/
│   ├── 01_data_exploration.ipynb    # Exploratory data analysis notebook
│   └── 02_kaggle_submission.ipynb   # Notebook used for Kaggle submission
├── outputs/
│   ├── metrics/                     # Saved metrics, histories, and poller state
│   │   ├── data_balance.json
│   │   ├── final_results.json
│   │   ├── model_comparison.json
│   │   ├── model_prediction_history.json
│   │   ├── training_history.json
│   │   └── poller/poller.json       # Central metric store for plotting
│   ├── models/                      # Serialized trained models
│   ├── plots/
│   │   ├── training/                # Per-model training plots (ROC/PR/threshold)
│   │   └── comparison/              # Combined plots from ModelComparisonPlotter
│   │       └── model_comparison_report.pdf
│   ├── predictions/                 # Test predictions & probability files
│   │   ├── *_test_predictions.csv
│   │   └── *_prediction_probabilities.csv
│   └── screenshot/                  # Screenshots used for documentation
├── scripts/
│   ├── run_training.py              # Main training pipeline (full workflow)
│   ├── run_plotting.py              # Standalone comparison plotting script
│   ├── setup_directories.py         # Creates folder structure for data/outputs
│   └── debug_toml.py                # Helper tool for debugging TOML configs
├── src/
│   ├── config.py                    # Central configuration loader
│   ├── data_handler.py              # Data downloading, validation, and preprocessing
│   ├── plotter.py                   # ModelComparisonPlotter (all combined plots)
│   ├── poller.py                    # Poller: stores metrics & curves to JSON
│   ├── training.py                  # ModelTrainer: training, evaluation, tuning
│   ├── utils.py                     # Logging, saving utilities, helpers
│   └── __init__.py
├── Makefile                         # UV-aware workflow with train/plot/clean commands
├── Dockerfile                       # Production API container definition
├── pyproject.toml                   # Project metadata & dependency management
├── requirements.txt                 # Optional pinned requirements
├── uv.lock                          # UV lockfile (exact dependency versions)
├── setup.cfg                        # Linting & formatting configuration
├── LICENSE
└── README.md

🔧 Features

Data Pipeline

Automated download (Kaggle API)
Column renaming + numerical rounding
Missing value validation
Train/Val/Test stratified split
JSON export of class balance

Model Training

Logistic Regression, Random Forest, Gradient Boosting, XGBoost
Hyperparameter optimisation (Grid / Random / Halving)
F₂-optimised threshold selection
Feature importance extraction
Calibration curves

Outputs

ROC & PR curves
Confusion matrices
F1/F2 threshold curves
Correlation heatmap
Summary comparison bar charts
Poller JSON capturing all model metadata

Configuration

model_config.toml

[logistic_regression]
max_iter = 500
solver = "lbfgs"

[random_forest]
n_estimators = 300
max_depth = 10

Data configuration (`scripts/config.py`)

Input/output paths
Train/val/test ratios
Random seeds

📈 Model Performance

Selected best model: Random Forest (recall-focused, F₂-optimised)

Test set performance (optimal threshold ≈ 0.36)

F₂-score: 0.8923
F₁-score: 0.8819
Recall: 0.8994
Precision: 0.8651
ROC–AUC: 0.9747
PR–AUC: 0.9306

All four classifiers (Logistic Regression, Random Forest, Gradient Boosting, XGBoost) reach ROC–AUC values around 0.97 on the validation set, which suggests that the HTRU2 dataset is relatively easy to separate using the chosen features.

Top features for the selected Random Forest model:

ip_kurtosis
ip_mean
ip_skewness

See the full report in docs/MODEL_PERFOMANCE.md.

🐳 Model Deployment

Quick API Deployment

# Build and run Docker container
docker build -t pulsar-classification-api:latest .
docker run -it -p 9696:9696 pulsar-classification-api:latest

API Endpoints

POST /predict - Single sample prediction
POST /predict_batch - Batch prediction
GET /health - Service health check
GET /features - Expected feature names

Deployment Options

Docker: Local container deployment
Fly.io: Cloud deployment (demonstrated)
Hugging Face Spaces: mchadolias/pulsar-classification-htru2

Full Deployment Guide: See DEPLOYMENT.md for detailed instructions.

🔧 Troubleshooting

Common Issues

Port already in use

docker run -it -p 9698:9696 pulsar-classification-api:latest

Model file not found
- Ensure best_xgboost_model.pkl exists in outputs/models/
Feature validation errors
- Verify exactly 8 features are provided
- Check feature order matches /features endpoint
Docker build failures
- Keep in mind that building issues have been observed when using a VPN, so rebuild it after you are disconnected.
```
docker build --no-cache -t pulsar-classification-api:latest .
```
Kaggle API Errors
- Verify kaggle.json credentials
- Check internet connection
- Ensure dataset is publicly accessible
Dependency Conflicts
- Use UV for isolated environments
- Check Python version compatibility
- Review dependency versions in pyproject.toml

Monitoring

# Check container status
docker ps

# View logs
docker logs <container_id>

# Health check
curl http://localhost:9696/health

📚 References

Academic & Dataset References

Key Library Documentation

FastAPI Documentation - Modern web framework for building APIs
XGBoost Documentation - Scalable and accurate gradient boosting
Scikit-learn Documentation - Machine learning in Python
Pydantic Documentation - Data validation using Python type annotations
UV Documentation - Fast Python package and project manager
Docker Documentation - Containerization platform
Uvicorn Documentation - ASGI web server implementation

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Last Updated: 08-12-2025
Last Pipeline Execution: 07-12-2025 Author: @mchadolias

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
configs		configs
deployment		deployment
docs		docs
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
uv.lock		uv.lock

License

mchadolias/detect-pulsar-api

Folders and files

Latest commit

History

Repository files navigation

🌌 Pulsar Star Classification Project

📋 Table of Contents

📋 Project Overview

🎯 Business Problem

📊 Dataset

Dataset Features

🚀 Quick Start

Prerequisites

Installation

Set up Kaggle API

Install dependencies via uv

Create data directories

Makefile Commands

Common commands

Usage

Full training pipeline

Generate comparison plots without re-training

Project Structure

🔧 Features

Data Pipeline

Model Training

Outputs

Configuration

model_config.toml

Data configuration (scripts/config.py)

📈 Model Performance

🐳 Model Deployment

Quick API Deployment

API Endpoints

Deployment Options

🔧 Troubleshooting

Common Issues

Monitoring

📚 References

Academic & Dataset References

Key Library Documentation

📝 License

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Data configuration (`scripts/config.py`)

Packages