1. Programming Languages: - Python, -Pyspark
2. Scripting Language: SQL
3. Data Base: -PostgreSQL, -MongoDB, -Bigquery
4. DWH
- POstgreSQL
- Snowflake
5. Orchestrator: -Airflow
6. Data Viz: -Dashplotly
Pipeline A Extracts data from excel files, transform and load processed data in MongoDB collections. Then data are Extracted from the staging db, modeled into facts and dimensions. And Load in postgres data warehouse. Airflow is used for orchestration. The data dashboard is run in the cloud with dash-ploty application deployed in a Heroku VM.
With pipeline B, The data are Extracted from excel source files, transform and loaded as blob files in Google Cloud Storage bucket that serves as data lake. A Snowflake Storage Integration is created and serves as staging area in snowflake datawarehouse. The data are Extracted from staging area, modeled in facts and dimensions. Then loaded in snowflake data warehouse. all the orchestration work is done with Airflow.
Pipeline C Extracts data from excel source files, transform and loaded as blob file in Google Cloud Storage bucket that serves as data lake. The transformed data are then Extracted from GCS bucket, modeled and load in Google BigQuery data warehouse. Likewise, all the orchestration work is done by Airflow. And the dashboard is run by dash-plotly and is hosted in an Heroku virtual marchine.
Data validation tasks are handled with great_expectations.