GCP E-commerce Event Pipeline

This project implements an end-to-end data pipeline for processing e-commerce events, leveraging Google Cloud Platform (GCP) services, Confluent Kafka, Docker, and other tools.

Overview

The pipeline ingests e-commerce event data, streams it through Confluent Cloud Kafka, stores it in Google BigQuery, transforms it using dbt, and finally makes it available for analysis and visualization via Looker Studio.

Data Used (No need to download it, the scripts will automatically do that)

https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-electronics-store

Architecture

The project follows these stages:

Data Source: E-commerce event data is initially sourced from a CSV file.
Producer: A Python application, containerized with Docker, reads the e-commerce event data from the CSV file and produces it as a real-time stream to a topic named ecom_events in Confluent Cloud Kafka.
Confluent Cloud Kafka: This managed Kafka service acts as the central message broker, decoupling the producer and consumer.
Consumer: A Python application, also containerized with Docker, consumes the events from the ecom_events Kafka topic.
Google BigQuery (Source Table): The consumer directly ingests the raw event data into a BigQuery table named kafka_ecom_events. Basic transformations applied at this stage.
dbt (Data Build Tool): dbt is used to perform more complex transformations on the data within BigQuery, creating staging and core tables for analytical purposes.
Google BigQuery (Transformed Tables): The transformed data resides in various tables within BigQuery, ready for querying.
Analytical Dashboards: Tools like Looker Studio can connect to the transformed BigQuery tables to create insightful dashboards and reports.

BigQuery Table Design Decisions

The BigQuery kafka_ecom_events table is designed with performance and cost-efficiency for analytical queries in mind. We've implemented the following partitioning and clustering strategy:

Partitioning by event_time: The table is partitioned by the event_time column on a daily basis. This is a crucial decision as most analytical queries on e-commerce data involve filtering by specific time ranges (e.g., daily, weekly, monthly reports). Partitioning allows BigQuery to only scan the relevant partitions, significantly reducing query costs and improving performance.
Clustering by event_type, product_id, and category_code: We've chosen to cluster the data by these three fields. This is based on the expectation that common analytical queries will involve filtering or aggregating data based on the type of event (e.g., purchases, views), specific products, or product categories. Clustering co-locates data with similar values for these columns within each partition, further enhancing query performance.
Kafka Partition Key (user_id): While the Kafka topic is partitioned by user_id (which is useful for ensuring message ordering per user within Kafka), we opted not to use user_id as the primary partitioning or clustering key in BigQuery. This is because our anticipated analytical queries are more likely to focus on time-based analysis and aggregations across different event types, products, and categories, rather than primarily filtering by individual users. We can still efficiently query data based on user_id in BigQuery using standard SQL filtering.

Technologies Used

Python: For developing the producer and consumer applications.
Confluent Kafka: A distributed streaming platform for message queuing.
Google Cloud Platform (GCP): The cloud environment hosting the project.
- BigQuery: For data warehousing and analytics.
Docker: For containerizing the producer and consumer applications.
dbt (data build tool): For performing data transformations in BigQuery.
Make: For automating build and deployment tasks.
Terraform: For defining and managing the infrastructure on GCP.
Looker Studio (example): For creating analytical dashboards.

Running the Applications

Note: "producer-fast" is only for quickly producing all the data at once,
in a real environment the data will be produced around the rate of "producer"

You can use the Makefile to simplify common tasks:
make download-data: Download and extract the initial e-commerce events data (if needed for local testing).
make build-producer: Build the Docker image for the producer.
make run-producer: Run the Docker container for the producer.
make stop-and-remove: Stop and remove all containers.
make build-producer-fast: Build the Docker image for the producer-fast (downloads data if needed).
make run-producer-fast: Run the Docker container for the producer-fast (builds image if necessary).
make build-consumer: Build the Docker image for the consumer.
make run-consumer: Run the Docker container for the consumer in interactive mode.
make run-consumer-detached: Run the consumer in detached mode (runs in the background).
make attach-consumer-logs: Attach to the logs of the detached consumer.
make dbt-run: Run the dbt transformations (prompts for GCP Project ID and passes it as a variable).
make dbt-build: Run the dbt build process (prompts for GCP Project ID and passes it as a variable).
make dbt-test: Run the dbt tests (prompts for GCP Project ID and passes it as a variable).
make dbt-clean: Clean the dbt project (remove the target directory).
make dbt-docs-generate: Generate the dbt documentation (prompts for GCP Project ID).
make clean: Stop all containers, remove their images, and clean dbt project and data files.
make help: Display a list of available make commands.

BigQuery Schema (`kafka_ecom_events` Table)

The kafka_ecom_events table in BigQuery has the following schema:

Field Name	Type	Mode	Description (Optional)
`event_time`	TIMESTAMP	NULLABLE	Timestamp of the event.
`event_type`	STRING	NULLABLE	Type of the e-commerce event (e.g., purchase, view).
`product_id`	STRING	NULLABLE	Identifier of the product.
`category_id`	STRING	NULLABLE	Identifier of the product category.
`category_code`	STRING	NULLABLE	Code representing the product category.
`brand`	STRING	NULLABLE	Brand of the product.
`price`	FLOAT	NULLABLE	Price of the product.
`user_id`	STRING	NULLABLE	Identifier of the user.
`user_session`	STRING	NULLABLE	Identifier of the user's session.

Dashboard

https://lookerstudio.google.com/reporting/56d74565-dc30-4f66-9a11-c9cc01d82686
Incase the link stops working since I will probably delete the data in bigquery

Configuration

Confluent

Create Confluent Cluster
Create api key
- Important : Rename key to "confluent_cluster_api.txt"
Create topic
- Name it "ecom_events"

Google Cloud

Create Project
Enable the following apis
Create Service account
- Hit "done"
- Click "Manage Keys"
- Click "Create new key" and then choose "JSON"
- Important: Rename it to "gcp_key.json"

Confluent Cloud: The producer and consumer applications rely on a Confluent Cloud API configuration file (confluent_cluster_api.txt) for connecting to your Kafka cluster. Ensure this file is placed in the project root.
GCP Credentials: The consumer application uses a GCP service account key file (gcp_key.json) to authenticate with Google BigQuery. Ensure this file is placed in the project root (exercise caution when handling this file and avoid committing it to public repositories).

Terraform

Install terraform on your machine follow guide https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli
git clone https://github.com/IssaAlBawwab/gcp-ecommerce-end-to-end.git
Make sure gcp_key.json and confluent_cluster_api.txt are in the project root
cd terraform/
terraform apply
enter gcp project id (get it from here copy "ID" not the name)
enter "yes"

Setup

enter VM

click "SSH"

After you configured everything above

sudo apt update
git clone https://github.com/IssaAlBawwab/gcp-ecommerce-end-to-end.git
upload keys (sometimes this glitches and it prompts you to retry so do that and reupload and it should work)
mv gcp_key.json gcp-ecommerce-end-to-end/
mv confluent_cluster_api.txt gcp-ecommerce-end-to-end/
cd gcp-ecommerce-end-to-end
sudo apt install make
sudo apt-get install -y unzip
install docker with the following

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

sudo usermod -aG docker $USER
close the vm ssh window, then reopen it, this is to make the group change effect we did to take place

cd gcp-ecommerce-end-to-end
make run-consumer-detached

make run-producer-fast
- you should see this
make attach-consumer-logs
- wait till you see this
- hit CTRL-C after
Query your table in bigquery
- should have ~ 885130 records, could be +- a small number
Back in the ssh

      sudo apt install python3-venv
      python3 -m venv dbt_venv
      source dbt_venv/bin/activate
      pip install dbt-core dbt-bigquery

run "make dbt-create-profile" and insert your project id
"make dbt-run" and insert your project id
- should see this after
Congrats now you should have these tables

Delete Resources

run "terraform destroy" in the machine that you used to run terraform

Disclaimer

AI has been used in helping me to write the readme and make file.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
consumer		consumer
dbt/ecommerce_dbt		dbt/ecommerce_dbt
images		images
producer		producer
terraform		terraform
.gitignore		.gitignore
Ecommerce_Data.pdf		Ecommerce_Data.pdf
Makefile		Makefile
README.md		README.md
explore.ipynb		explore.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GCP E-commerce Event Pipeline

Overview

Data Used (No need to download it, the scripts will automatically do that)

Architecture

BigQuery Table Design Decisions

Technologies Used

Running the Applications

BigQuery Schema (`kafka_ecom_events` Table)

Dashboard

Configuration

Confluent

Google Cloud

Terraform

Setup

enter VM

After you configured everything above

Delete Resources

Disclaimer

About

Uh oh!

Releases

Packages

Languages

IssaAlBawwab/gcp-ecommerce-end-to-end

Folders and files

Latest commit

History

Repository files navigation

GCP E-commerce Event Pipeline

Overview

Data Used (No need to download it, the scripts will automatically do that)

Architecture

BigQuery Table Design Decisions

Technologies Used

Running the Applications

BigQuery Schema (kafka_ecom_events Table)

Dashboard

Configuration

Confluent

Google Cloud

Terraform

Setup

enter VM

After you configured everything above

Delete Resources

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

BigQuery Schema (`kafka_ecom_events` Table)

Packages