This project provides insights into open-source development trends by analyzing GitHub Watch events. Built using GCP, Airflow, and BigQuery, it allows stakeholders to explore developer engagement and repository popularity in real-time.
This project analyzes GitHub repository activity using the GitHub Archive dataset. The goal is to extract, process, and analyze watch events to uncover insights into repository trends, user engagement, and activity patterns over time.
Key analytics include:
- Identifying the most popular repositories
- Analyzing user engagement patterns
- Understanding temporal trends in GitHub activities
This pipeline provides actionable insights into open-source project popularity and developer behavior.
The architecture follows a modular design, ensuring scalability and maintainability. It integrates cloud services, workflow orchestration, and data processing tools.
The project is deployed on Google Cloud Platform (GCP) using Infrastructure as Code (IaC) with Terraform.
- Google Cloud Storage (GCS): Acts as a data lake for storing processed GitHub Archive data.
- Google BigQuery: Serves as the data warehouse for analytics queries.
- Terraform provisions and manages GCP resources.
- Key resources include:
- GCS bucket for data lake storage
- BigQuery dataset for data warehousing
To deploy the infrastructure:
cd terraform
terraform init
terraform plan
terraform applyApache Airflow orchestrates the data pipeline with a DAG that performs the following steps:
- Download hourly GitHub Archive data (JSON format).
- Transform raw data into Parquet format.
- Upload processed data to Google Cloud Storage.
- Create external tables in BigQuery for analytics.
- BashOperator: Downloads data.
- PythonOperator: Transforms data.
- GCS Operators: Handles cloud storage operations.
- BigQuery Operators: Manages data warehouse operations.
To run the workflow:
cd airflow
make upEnsure the env.json file is properly configured before starting Airflow.
BigQuery is the primary data warehouse:
- External tables are created from GCS Parquet files.
- Optimized for analytical queries on GitHub data.
Data transformations are performed using:
-
Python/Pandas: For initial ETL processing.
- Filters relevant GitHub events.
- Converts data to optimized Parquet format.
-
SQL: For advanced transformations and analytics in BigQuery.
The project includes an interactive dashboard for data visualization.
Key dashboard components:
- Repository popularity trends over time.
- Top trending repositories by watch events.
- User engagement metrics and patterns.
The dashboard is built using Streamlit and can be accessed locally at http://localhost:8501. or demo 1, demo 2
To start the Streamlit server using Docker:
docker-compose upEnsure the docker-compose.yml file is properly configured for Streamlit visualization.
- Google Cloud Platform account with billing enabled.
- Docker and Docker Compose.
- Terraform.
- Python 3.9+.
-
Clone the repository:
git clone https://github.com/a920604a/data-engineering-zoomcamp-2025.git cd project -
Set up GCP credentials:
- Create a service account with appropriate permissions.
- Download the JSON key file.
- Place it in the project directory as
service-account.json.
-
Deploy cloud infrastructure:
cd terraform terraform init terraform apply -
Start Airflow:
cd airflow make up -
Access services:
- Airflow UI: http://localhost:8080
-
Environment Variables: Airflow requires an
env.jsonfile to store sensitive variables. Place this file in theairflowdirectory.Example
env.json:{ "GCP_PROJECT": "your-gcp-project-id", "GCS_BUCKET": "your-gcs-bucket-name", "BIGQUERY_DATASET": "your-bigquery-dataset-name" } -
Trigger the pipeline:
- From the Airflow UI, trigger the
cloud_gharchive_dagDAG.
- From the Airflow UI, trigger the
Follow these steps to fully reproduce the project and start analyzing GitHub data.
- Infrastructure: Terraform, Google Cloud Platform.
- Workflow Orchestration: Apache Airflow.
- Storage: Google Cloud Storage.
- Data Warehouse: Google BigQuery.
- Data Processing: Python, Pandas, PySpark.
- Containerization: Docker, Docker Compose.
- Visualization: Streamlit.
├── airflow/ # Airflow DAGs and configs
├── terraform/ # IaC scripts for GCP
├── Visual/ # Dashboard code
└── README.md # Project overview