GCP Batch Data Pipeline

The main idea of the project was to implement a simple logic of the Scheduled Batch Data Processing Pipeline with the usage of Google Cloud services such as (Google Cloud Storage Buckets, BigQuery, Cloud Run Jobs, Cloud Scheduler), Processing/Scheduling services (Apache Airflow, Apache Spark, Pandas Framework), Visualization (NodeJS, ReactJS, ChartJS) and Deployment (Terraform, Docker)

Description

Principle of Work

The principle of work is as follows:

Java-based webscraper which is deployed on Cloud Run and scheduled with Cloud Scheduler extracts the descriptive data about first 500 hundred cryptocurrencies once in hour. Example of the single record being extracted by the scraper {"Type":"Coin","Price":20267.15,"Volume":49138747923,"Network":"Own","Time":"14-09-2022 09:35:05","Tag":"BTC","Name":"Bitcoin","MarketCap":388119294728}
Collected data is stored in the form of the .json file in the Landing Data Lake (GCS)
Processing Job is triggered right after the new .json file was placed in the Data Lake.
- Firstly, Script extracts the the .json files stored in the Landing Data Lake, for each of the cryptocurrencies script calculates the price change during the last 7 days (per day) and during the last 24 hours (per hour). In other words it tells how the price of a specific cryptocurrency was changing between last 7 days (in case of weekly threshold) and between hours of today's date.
- Secondly, script groups cryptocurrencies by the Blockchain Network on which they are deployed on (Ethereum, BNB Smartchain, Own) and calculates the average price of cryptocurrencies that are located under the certain Network. After that it performs the caluclations explained in the first step (calculating price differences per last week and per today's date hours).
- Thirdly, script takes the dataframe grouped by Networks from the second step and for each of the unique Networks, it calculates the amount of unique cryptocurrencies that are deployed on it.
- Finally, script partitions each dataframe by 4 and deploys to the "Processed" GCS Bucket
Processed GCS Bucket consists of the following directories ("dates_coin", "dates_net", "hours_coin", "hours_net", "summarize_net"), each of them is connected to BigQuery as an External Data Source.
Visulization Application uses BigQuery REST API to access the data stored in the "Prepared Bucket", so that it was able to generate the respective graphs.

Architecture Visualization

How to Run

Make sure that you have terraform, docker, docker-compose installed on your machine. If not please install all of them
Store you GCP credentials .json file under the following path ~/.google/credentials/google_credentials.json
Access the visualization_app/docker-compose.yaml and change the projectID field to your project ID. Do the same with the airflow/docker-compose.yaml
Access the terraform/ directory and run terraform apply, it will deploy all the infrastructure for you automatically

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
airflow		airflow
data-landing		data-landing
data-prepared		data-prepared
images		images
java-injector		java-injector
jobs		jobs
terraform		terraform
visualization_app		visualization_app
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GCP Batch Data Pipeline

Description

Principle of Work

Architecture Visualization

How to Run

Final Result

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

mvksxm/GCS-DataPipeline

Folders and files

Latest commit

History

Repository files navigation

GCP Batch Data Pipeline

Description

Principle of Work

Architecture Visualization

How to Run

Final Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages