This project builds an end-to-end pipeline for predicting happiness scores of countries using machine learning and real-time data streaming. It incorporates:
- 📊 Exploratory Data Analysis (EDA) for insights and transformations.
- 🧠 Machine Learning regression to predict happiness scores.
- 🚀 Kafka-based streaming for real-time processing.
- 🗄️ Database integration for storing predictions.
- 🛠️ API for on-demand predictions hosted on a scalable platform.
The pipeline is modular and automated, enabling seamless updates and experimentation with models.
🔍 More details on each phase and each node can be found in this document
api/
├── ... # All the Happiness Prediction API.
│ ├── context/
│ │ ├── Workshop 3 Machine learning and Data streaming.pdf # Rubric of the project
│ ├── dags/
│ │ ├── ... # All the files for Airflow Pipeline
│ ├── docs/
│ │ ├── api /
│ │ | ├── Happiness Prediction API - Methods.postman_collection.json # API methods
│ │ ├── ... # Copy of the documentation
│ ├── kafka/
│ │ ├── consumer / # Kafka config for the consumer container
│ │ ├── producer / # Kafka config for the producer container
│ ├── models/
│ │ ├── 00_happiness_score_prediction_model.pkl # Base model
│ ├── notebooks/
│ │ ├── ... # Notebooks for all the proyect steps
│ ├── sql/
│ │ ├── queries /
│ │ ├── api /
│ ├── sql/
│ │ ├── connections /
│ │ ├── utils /
├── Dockerfile # Docker configuration for containerizing the Airflow
├── docker-compose.yml # docker-compose configuration
├── pyproject.yml # Poetry Config
├── README.md # Documentation (you're reading it now!)
git clone https://github.com/DCajiao/workshop003_Machine_learning_and_Data_streaming.git
cd workshop003_Machine_learning_and_Data_streaming
Create a .env
file in src/
and add the following:
DBNAME=...
DBUSER=...
DBPASS=...
DBHOST=...
DBPORT=5432
Run the following commands:
docker-compose up airflow-init
docker-compose up -d
- Go to
http://localhost:8080
. - Log in with:
- Username:
airflow
- Password:
airflow
- Username:
- Activate and trigger the DAG to execute the pipeline.
- Watch in real time how data is sent from the producer
- Watch in real time how the consumer makes the request to the prediction API and inserts the features with the prediction into the database within the
predicted_data
table.
If you want to learn more about how this project works, you can find a more detailed analysis at the final report