An end-to-end MLOps pipeline for predicting student academic risk (Graduate, Dropout, Enrolled). Features data versioning, experiment tracking (MLflow), hyperparameter tuning, FastAPI deployment, Docker containerization, and CI/CD automation with GitHub Actions.
This project implements a complete, production-grade machine learning pipeline to predict student academic risk in higher education. The model classifies students into three categories: Graduate, Dropout, or Enrolled.
It is built with a focus on MLOps best practices, demonstrating how to move from a raw dataset to a deployable, scalable API. The system includes automated training, hyperparameter tuning, experiment tracking, and containerized deployment.
- Modular Codebase: Clean separation of concerns (data loading, preprocessing, training, tuning, deployment).
- Robust Preprocessing: Custom feature engineering and
scikit-learnpipelines for data transformation. - Experiment Tracking: Integration with MLflow to log parameters, metrics, and model artifacts.
- Hyperparameter Tuning: Automated optimization using
RandomizedSearchCV. - REST API: A high-performance FastAPI application for real-time predictions.
- Containerization: Fully Dockerized application for consistent deployment.
- CI/CD: Automated build and push workflows using GitHub Actions.
student_risk_predictor/
├── .github/
│ └── workflows/
│ └── ci-cd.yml # GitHub Actions workflow for CI/CD
├── app/ # FastAPI Application
│ ├── init.py
│ ├── main.py # API server logic
│ └── schemas.py # Pydantic models for data validation
├── artifacts/ # Generated files (models, encoders, metrics)
│ └── (Populated automatically by scripts)
├── data/ # Raw Data
│ ├── train.csv # Training dataset
│ └── test.csv # Test dataset (optional)
├── mlruns/ # MLflow tracking data (auto-generated)
├── notebooks/ # Jupyter Notebooks
│ └── 1-Data-Exploration.ipynb
├── src/ # Core ML Source Code
│ ├── init.py
│ ├── data_loader.py # Data loading and splitting logic
│ ├── preprocessor.py # Preprocessing pipeline definition
│ ├── train.py # Model training and selection script
│ ├── tune.py # Hyperparameter tuning script
│ └── utils.py # Helper functions
├── .gitignore
├── Dockerfile # Docker image configuration
├── params.yaml # Configuration file for parameters
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Python 3.8+
- Git
- Docker (optional for local dev, required for containerization)
-
Clone the repository:
git clone [https://github.com/yourusername/student-risk-predictor.git](https://github.com/yourusername/student-risk-predictor.git) cd student-risk-predictor -
Create and activate a virtual environment:
python -m venv venv # Windows venv\Scripts\activate # Mac/Linux source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Data Setup: Ensure you have the
train.csvfile placed inside thedata/directory.
Follow these steps to reproduce the entire training and deployment process.
Run the Jupyter notebook to understand the dataset distribution and correlations.
# Open the notebook in your editor or Jupyter Lab
notebooks/1-Data-Exploration.ipynb