GitHub Data Analytics Pipeline

This project provides insights into open-source development trends by analyzing GitHub Watch events. Built using GCP, Airflow, and BigQuery, it allows stakeholders to explore developer engagement and repository popularity in real-time.

1. Problem Description

This project analyzes GitHub repository activity using the GitHub Archive dataset. The goal is to extract, process, and analyze watch events to uncover insights into repository trends, user engagement, and activity patterns over time.

Key analytics include:

Identifying the most popular repositories
Analyzing user engagement patterns
Understanding temporal trends in GitHub activities

This pipeline provides actionable insights into open-source project popularity and developer behavior.

2. Architecture

The architecture follows a modular design, ensuring scalability and maintainability. It integrates cloud services, workflow orchestration, and data processing tools.

3. Cloud Infrastructure

The project is deployed on Google Cloud Platform (GCP) using Infrastructure as Code (IaC) with Terraform.

Cloud Components:

Google Cloud Storage (GCS): Acts as a data lake for storing processed GitHub Archive data.
Google BigQuery: Serves as the data warehouse for analytics queries.

Infrastructure as Code:

Terraform provisions and manages GCP resources.
Key resources include:
- GCS bucket for data lake storage
- BigQuery dataset for data warehousing

To deploy the infrastructure:

cd terraform
terraform init
terraform plan
terraform apply

4. Data Ingestion - Batch Processing & Workflow Orchestration

Apache Airflow orchestrates the data pipeline with a DAG that performs the following steps:

Download hourly GitHub Archive data (JSON format).
Transform raw data into Parquet format.
Upload processed data to Google Cloud Storage.
Create external tables in BigQuery for analytics.

Key Airflow DAG components:

BashOperator: Downloads data.
PythonOperator: Transforms data.
GCS Operators: Handles cloud storage operations.
BigQuery Operators: Manages data warehouse operations.

To run the workflow:

cd airflow
make up

Ensure the env.json file is properly configured before starting Airflow.

5. Data Warehouse

BigQuery is the primary data warehouse:

External tables are created from GCS Parquet files.
Optimized for analytical queries on GitHub data.

6. Transformations

Data transformations are performed using:

Python/Pandas: For initial ETL processing.
- Filters relevant GitHub events.
- Converts data to optimized Parquet format.
SQL: For advanced transformations and analytics in BigQuery.

7. Dashboard

The project includes an interactive dashboard for data visualization.

Key dashboard components:

Repository popularity trends over time.
Top trending repositories by watch events.
User engagement metrics and patterns.

Streamlit Dashboard:

The dashboard is built using Streamlit and can be accessed locally at http://localhost:8501. or demo 1, demo 2

To start the Streamlit server using Docker:

docker-compose up

Ensure the docker-compose.yml file is properly configured for Streamlit visualization.

8. Reproducibility

Prerequisites:

Google Cloud Platform account with billing enabled.
Docker and Docker Compose.
Terraform.
Python 3.9+.

Setup and Deployment:

Clone the repository:

git clone https://github.com/a920604a/data-engineering-zoomcamp-2025.git
cd project

Set up GCP credentials:
- Create a service account with appropriate permissions.
- Download the JSON key file.
- Place it in the project directory as service-account.json.

Deploy cloud infrastructure:

cd terraform
terraform init
terraform apply

Start Airflow:
```
cd airflow
make up
```
Access services:
- Airflow UI: http://localhost:8080
Environment Variables: Airflow requires an env.json file to store sensitive variables. Place this file in the airflow directory.

Example env.json:
```
{
"GCP_PROJECT": "your-gcp-project-id",
"GCS_BUCKET": "your-gcs-bucket-name",
"BIGQUERY_DATASET": "your-bigquery-dataset-name"
}
```
Trigger the pipeline:
- From the Airflow UI, trigger the cloud_gharchive_dag DAG.

Follow these steps to fully reproduce the project and start analyzing GitHub data.

Technologies Used

Infrastructure: Terraform, Google Cloud Platform.
Workflow Orchestration: Apache Airflow.
Storage: Google Cloud Storage.
Data Warehouse: Google BigQuery.
Data Processing: Python, Pandas, PySpark.
Containerization: Docker, Docker Compose.
Visualization: Streamlit.

Project Structure

  ├── airflow/ # Airflow DAGs and configs 
  ├── terraform/ # IaC scripts for GCP 
  ├── Visual/ # Dashboard code 
  └── README.md # Project overview

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Visual		Visual
airflow		airflow
terraform		terraform
ActivityDashboard.pdf		ActivityDashboard.pdf
GitHub WatchEvent 熱門 Repo.pdf		GitHub WatchEvent 熱門 Repo.pdf
Readme.md		Readme.md
Readme_zh.md		Readme_zh.md
architecture-diagram.svg		architecture-diagram.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

GitHub Data Analytics Pipeline

1. Problem Description

2. Architecture

3. Cloud Infrastructure

Cloud Components:

Infrastructure as Code:

4. Data Ingestion - Batch Processing & Workflow Orchestration

Key Airflow DAG components:

5. Data Warehouse

6. Transformations

7. Dashboard

Streamlit Dashboard:

8. Reproducibility

Prerequisites:

Setup and Deployment:

Technologies Used

Project Structure

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

a920604a/gitHub-data-analytics

Folders and files

Latest commit

History

Repository files navigation

GitHub Data Analytics Pipeline

1. Problem Description

2. Architecture

3. Cloud Infrastructure

Cloud Components:

Infrastructure as Code:

4. Data Ingestion - Batch Processing & Workflow Orchestration

Key Airflow DAG components:

5. Data Warehouse

6. Transformations

7. Dashboard

Streamlit Dashboard:

8. Reproducibility

Prerequisites:

Setup and Deployment:

Technologies Used

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages