Student Performance ETL Pipeline

This project implements an ETL pipeline to process and load student performance data into Snowflake using Python, Pandas, and SQLAlchemy. The project consists of a Flask web service to expose the dataset, followed by the extraction, transformation, and loading of data into Snowflake for further analysis.

Project Overview:

Source Data: The student performance dataset contains student scores across math, reading, and writing, along with demographic information such as gender, race/ethnicity, and parental education.
ETL Pipeline: The ETL pipeline is built using a combination of Flask for serving the CSV data, Pandas for transformation, and SQLAlchemy for loading data into Snowflake.
Data Storage: The transformed data is stored in a Snowflake database for further analysis and querying.

Features:

Flask API:

Serves the CSV file via an endpoint.
Provides both CSV and JSON formats for data retrieval.

Data Transformation:

Renames columns for better readability and standardization (e.g., race/ethnicity to RACE_ETHNICITY).
Uses Pandas to perform the transformation.

Loading to Snowflake:

Data is loaded into a Snowflake table using SQLAlchemy and Snowflake Connector for Python.
Automatically verifies the data insertion by querying the Snowflake table.

Data Transformation & Loading steps (ETL Process)

Extraction:

Data is fetched from the Flask API (/csvdata).

Transformation:

Renames columns to uppercase and standardizes field names.
Inspects data types for consistency.

Loading:

Uses SQLAlchemy to connect to Snowflake.
Loads the transformed data into a Snowflake table.
Verifies data loading by running a simple SELECT query.

How to Run:

Ensure the Flask API is running.
Run the Load.py script to execute the ETL process.

SQL Scripts for Data Staging and Transformation

Files:
- staging_data.sql: SQL scripts for staging raw data in Snowflake.
- transformation.sql: SQL scripts for transforming staged data into the final table structure.
- load.sql: SQL scripts for loading transformed data into the final table.

Technologies Used:

Flask: Web framework used to serve CSV and JSON data.
Pandas: For data transformation and preparation.
Snowflake: Cloud-based data warehousing solution where the data is stored and queried.
SQLAlchemy: Used to connect to and interact with Snowflake.
Snowflake Connector: For loading data into Snowflake directly from a Pandas DataFrame.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Load.py		Load.py
README.md		README.md
app.py		app.py
load.sql		load.sql
staging_data.sql		staging_data.sql
transformation.sql		transformation.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Student Performance ETL Pipeline

Project Overview:

Features:

Flask API:

Data Transformation:

Loading to Snowflake:

Data Transformation & Loading steps (ETL Process)

Extraction:

Transformation:

Loading:

How to Run:

SQL Scripts for Data Staging and Transformation

Technologies Used:

About

Releases

Packages

Languages

miladkhanlou/Student-Performance-ETL-Pipeline-snowflake

Folders and files

Latest commit

History

Repository files navigation

Student Performance ETL Pipeline

Project Overview:

Features:

Flask API:

Data Transformation:

Loading to Snowflake:

Data Transformation & Loading steps (ETL Process)

Extraction:

Transformation:

Loading:

How to Run:

SQL Scripts for Data Staging and Transformation

Technologies Used:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages