Skip to content

DCajiao/workshop003_Machine_learning_and_Data_streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📋 Overview

This project builds an end-to-end pipeline for predicting happiness scores of countries using machine learning and real-time data streaming. It incorporates:

  • 📊 Exploratory Data Analysis (EDA) for insights and transformations.
  • 🧠 Machine Learning regression to predict happiness scores.
  • 🚀 Kafka-based streaming for real-time processing.
  • 🗄️ Database integration for storing predictions.
  • 🛠️ API for on-demand predictions hosted on a scalable platform.

The pipeline is modular and automated, enabling seamless updates and experimentation with models.


🛠️ Tools and Technologies

Technology Purpose
Python Development and scripting
Docker Containerization of services
Kafka Real-time data streaming
Airflow Workflow orchestration and pipeline automation
PostgreSQL Database for storing predictions
Scikit-learn Machine learning model development
Matplotlib Data visualization
Plotly Advanced visualizations
Pickle Model serialization
Poetry Dependency management and packaging

🏗️ Project Nodes

diagram

🔍 More details on each phase and each node can be found in this document


📂 Repository Contents

api/
├── ... # All the Happiness Prediction API. 
│   ├── context/
│   │   ├── Workshop 3 Machine learning and Data streaming.pdf # Rubric of the project
│   ├── dags/
│   │   ├── ...  # All the files for Airflow Pipeline
│   ├── docs/
│   │   ├── api /
│   │   |   ├── Happiness Prediction API - Methods.postman_collection.json # API methods
│   │   ├── ...  # Copy of the documentation
│   ├── kafka/
│   │   ├── consumer / # Kafka config for the consumer container
│   │   ├── producer / # Kafka config for the producer container
│   ├── models/
│   │   ├── 00_happiness_score_prediction_model.pkl # Base model 
│   ├── notebooks/
│   │   ├── ... # Notebooks for all the proyect steps
│   ├── sql/
│   │   ├── queries /
│   │   ├── api /
│   ├── sql/
│   │   ├── connections /
│   │   ├── utils /
├── Dockerfile            # Docker configuration for containerizing the Airflow
├── docker-compose.yml    # docker-compose configuration
├── pyproject.yml         # Poetry Config
├── README.md                    # Documentation (you're reading it now!)

🚀 How to Run the Project

1️⃣ Clone the Repository

git clone https://github.com/DCajiao/workshop003_Machine_learning_and_Data_streaming.git
cd workshop003_Machine_learning_and_Data_streaming

2️⃣ Configure Environment Variables

Create a .env file in src/ and add the following:

DBNAME=...
DBUSER=...
DBPASS=...
DBHOST=...
DBPORT=5432

3️⃣ Start the Services

Run the following commands:

docker-compose up airflow-init
docker-compose up -d

4️⃣ Access Airflow to run the training pipeline

  • Go to http://localhost:8080.
  • Log in with:
    • Username: airflow
    • Password: airflow
  • Activate and trigger the DAG to execute the pipeline.

5️⃣ Access to consumer and producer container logs

  • Watch in real time how data is sent from the producer
  • Watch in real time how the consumer makes the request to the prediction API and inserts the features with the prediction into the database within the predicted_data table.

If you want to learn more about how this project works, you can find a more detailed analysis at the final report

🌟 Enjoy exploring the automated happiness prediction pipeline! 😊