Skip to content

Tucker459/conair

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conair

Description

For my CS598 Cloud Computing Capstone we were tasked with creating data pipelines one using a batch-style of processing and another using stream-style processing. This is my batch style processing project Conair. I was tasked with architecing out a big data processing solution to answer a multitode of questions using airline data between the years of 1988 - 2008.

I decided to use Apache Hive on AWS EMR as my main application. The benefits for me using hive was that I could use HQL their dialect for SQL which allows me to abstract away a lot of the finer details of map-reduce and focus on the core logic.

The main applications that I used for this project was EMR, Hadoop, Hive, DynamoDB, S3, EMRFS, and AWS Datapipeline for orchestration. I talked about my entire process from extracting and cleaning of the data to optimizations that I used to speed up queries in my report.

Read The Report

Watch The Video

For extracting and data cleaning please take a look at some of my handyScripts.

Follow along as I tried different optimizations and configurations in my notes and random commands. I also took screenshots as I tried out different optimizations in Hive when querying the tables.

I also wrote go cli scripts to query the DynamoDB database for group 2 and 3.2 questions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages