Airline Data Ingestion & Processing on AWS

Project Overview

The Airline Data Ingestion & Processing Project is a cloud-based ETL pipeline built on AWS to process airline flight data. The project automates data ingestion, transformation, and storage using AWS S3, Glue, Redshift, EventBridge, Step Functions, and SNS. The processed data is stored in Amazon Redshift for further querying and analysis.

Project Architecture

The architecture consists of:

Data Source: Flight data (airport_dim and flight_raw) in CSV format is uploaded to an S3 bucket from airline client and flight_raw is copied to redshift.
Glue Crawler: Crawl the the airport_dim data and processed flight_raw data from redshift and store the data in glue data catalog.
Event Bridge Rule: AWS EventBridge detects new file uploads and triggers a Step Function workflow.
Step Functions: Automates the ETL process by orchestrating Glue Crawler and Glue Jobs.
AWS Glue ETL: Processes and transforms raw flight data.
Amazon Redshift: Stores the cleaned and processed flight data for querying.
Amazon SNS: Sends notifications about job status.

Project Execution on AWS

Setting Up EventBridge

EventBridge monitors the S3 bucket for new CSV files and triggers Step Functions.

EventBridge Rule Configuration (event_bridge_rule.json):

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["airlines-data-ingestion-project"]
    },
    "object": {
      "key": [{
        "suffix": ".csv"
      }]
    }
  }
}

Setting Up AWS Glue ETL Job

The Glue job (glue_etl_job.py) extracts flight data from S3, enriches it with airport details, and loads it into Redshift.

Key steps:

• Read raw flight data from S3.

• Join with airport codes for enrichment.

• Store processed data in Amazon Redshift.

Step Functions Orchestration

Step Functions automate the execution of Glue Crawlers and Glue Jobs.

Configuration (step_function_config.json):

• Starts Glue Crawlers to catalog raw data.

• Checks for crawler completion.

• Runs Glue ETL job to transform data.

• Sends success/failure notifications via SNS.

Amazon Redshift for Processed Data Storage

Redshift tables store the cleaned flight data for analysis.

Schema & Table Creation (redshift_create_table_commands.txt):

CREATE TABLE airlines.daily_flights_processed (
    carrier VARCHAR(10),
    dep_airport VARCHAR(200),
    arr_airport VARCHAR(200),
    dep_city VARCHAR(100),
    arr_city VARCHAR(100),
    dep_state VARCHAR(100),
    arr_state VARCHAR(100),
    dep_delay BIGINT,
    arr_delay BIGINT
);

Querying processed data:

SELECT * FROM airlines.daily_flights_processed LIMIT 5;

S3 Data Storage

Flight data and airport codes are stored in S3 before processing.

SNS Notifications for Job Status

Amazon SNS sends notifications on Glue job success/failure.

How to Run the Project

Upload new flight data to the S3 bucket (airlines-data-ingestion-project).
EventBridge triggers the Step Function workflow.
Glue ETL job processes the data.
Processed data is stored in Amazon Redshift.
Query Redshift for flight insights.

Conclusion

This project automates airline data ingestion, transformation, and storage using AWS services, making it scalable and efficient. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!