GCP Data Engineering Mini Projects

This repository contains a collection of mini projects focused on data engineering using Google Cloud Platform (GCP) services. Each project demonstrates different aspects of data engineering, including data storage, processing, and analysis using various GCP tools and services.

Prerequisites

Google Cloud Platform account
Python 3.7 or higher
Virtual environment setup

Setting Up Google Credentials

Google Cloud Credentials

Create a Google Cloud Project:
- Go to the Google Cloud Console and create a new project.
Enable APIs:
- Enable the necessary APIs for your project, such as BigQuery, Cloud Storage, and Dataproc.
Create Service Account:
- Navigate to the "IAM & Admin" section and select "Service Accounts".
- Create a new service account and grant it the necessary roles (e.g., BigQuery Admin, Storage Admin).
Generate Key File:
- After creating the service account, generate a new JSON key file and download it.
Set Up Credentials:
- Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the downloaded JSON key file:
```
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-file.json"
```

Clone the repository:

git clone https://github.com/your-username/gcp-dataengineering-mini-projects.git
cd gcp-dataengineering-mini-projects

Projects

1. Event Driven File Processing

🛠 Project 1/50 | Tools: Cloud Functions, GCS, BigQuery

This project deploys a Google Cloud Function that triggers when a file is uploaded to a GCS bucket and loads it into a BigQuery table automatically.

Features:

Automatic CSV file processing on GCS upload
Schema auto-detection in BigQuery
Event-driven serverless architecture

2. Real-time Data Ingestion

🛠 Project 2/50 | Tools: Pub/Sub, Dataflow (Apache Beam), BigQuery

Real-time data processing pipeline using Pub/Sub for message ingestion, Dataflow for stream processing, and BigQuery for analytics.

Features:

Pub/Sub message streaming
Apache Beam pipeline processing
Real-time data insertion to BigQuery
Support for both local testing and cloud deployment

3. Weather Data Pipeline

🛠 Project 3/50 | Tools: Cloud Functions, Cloud Scheduler, BigQuery, Open-Meteo API

Automated weather data collection system that fetches live weather data every 15 minutes and stores it in BigQuery for analysis.

Features:

Scheduled data collection (15-minute intervals)
Open-Meteo API integration (no API key required)
Historical weather data accumulation
UTC timestamp handling

4. Batch Processing with Spark

🛠 Project 4/50 | Tools: Dataproc, PySpark, BigQuery External Tables, GCS

Process historical order data using PySpark on Dataproc, write partitioned Parquet files to GCS, and create BigQuery External tables for reporting.

Features:

PySpark data processing on Dataproc
Partitioned Parquet file output
BigQuery External table integration
Automated cluster management

5. Data Quality Checks Framework

🛠 Project 5/50 | Tools: Cloud Functions, BigQuery, Cloud Scheduler, PyYAML

Automated data quality validation framework that runs configurable checks on BigQuery tables with scheduled monitoring.

Features:

Configurable data quality checks (null, duplicate, range validation)
Partition-based quality assessment
Automated scheduling with Cloud Scheduler
Quality metrics reporting and tracking

6. Medallion Architecture with Iceberg

🛠 Project 6/50 | Tools: Dataproc, Apache Spark, Apache Iceberg, BigQuery, Dataproc Metastore

Complete medallion architecture (Bronze-Silver-Gold) implementation using Dataproc Spark, Apache Iceberg, and BigQuery external tables.

Features:

Three-tier medallion architecture (Bronze/Silver/Gold layers)
Apache Iceberg ACID transactions and time travel
Schema evolution capabilities
BigQuery external table integration
Dataproc Metastore for metadata management

7. Lambda Architecture with Iceberg & BigQuery

🛠 Project 7/50 | Tools: Pub/Sub, Dataflow, BigQuery, Apache Iceberg, Dataproc Serverless, GCS

Advanced Lambda Architecture implementation with both real-time (speed layer) and batch processing layers for comprehensive data processing.

Features:

Speed Layer: Pub/Sub → Dataflow → BigQuery (real-time analytics)
Batch Layer: Pub/Sub → Dataflow → GCS → Spark → Iceberg (comprehensive analysis)
Serving Layer: BigQuery for real-time queries, Iceberg for complex analytics
Apache Iceberg for ACID transactions and time travel
Dataproc Serverless for batch processing
End-to-end monitoring and observability

Architecture Progression

This collection demonstrates a progression from simple to complex data engineering architectures:

Event-driven (Project 1) → Streaming (Project 2) → Scheduled (Project 3)
Batch Processing (Project 4) → Data Quality (Project 5)
Medallion Architecture (Project 6) → Lambda Architecture (Project 7)

Each project builds upon concepts from previous ones while introducing new GCP services and architectural patterns.

Follow for Updates

I am planning to add 50 such projects to this repository every week. Keep following my LinkedIn for updates.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
1.event-driven-file-processing		1.event-driven-file-processing
2.realtime-data-ingestion		2.realtime-data-ingestion
3.weather-batch-processing		3.weather-batch-processing
4.dataproc-spark-processing		4.dataproc-spark-processing
5.data-quality-checks-framework		5.data-quality-checks-framework
6.dataproc-spark-iceberg-medallion		6.dataproc-spark-iceberg-medallion
7.iceberg-bigquery-lambda-architecture		7.iceberg-bigquery-lambda-architecture
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GCP Data Engineering Mini Projects

Prerequisites

Setting Up Google Credentials

Google Cloud Credentials

Clone the repository:

Projects

1. Event Driven File Processing

2. Real-time Data Ingestion

3. Weather Data Pipeline

4. Batch Processing with Spark

5. Data Quality Checks Framework

6. Medallion Architecture with Iceberg

7. Lambda Architecture with Iceberg & BigQuery

Architecture Progression

Follow for Updates

About

Uh oh!

Releases

Packages

Languages

dockerai09/gcp-data-engineering-projects

Folders and files

Latest commit

History

Repository files navigation

GCP Data Engineering Mini Projects

Prerequisites

Setting Up Google Credentials

Google Cloud Credentials

Clone the repository:

Projects

1. Event Driven File Processing

2. Real-time Data Ingestion

3. Weather Data Pipeline

4. Batch Processing with Spark

5. Data Quality Checks Framework

6. Medallion Architecture with Iceberg

7. Lambda Architecture with Iceberg & BigQuery

Architecture Progression

Follow for Updates

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages