This folder contains Jupyter notebooks designed to explore the usage of PySpark, a powerful tool for distributed data processing. Inside this repo you will find examples for both batch processing as well as streaming. Those are:
- For batch we only use PySpark and you can find the notebook for this inside the folder
notebooks/batch_procssing
. Just follow the instructions below in order to have all the data needed for this process - For streaming, you can see the notebook inside the folder
notebooks/streaming_processing
. Under this folder you have:kafka_producer.py
- this is a simple Python script that will publish random messages into a Kafka Topic. The Kafka cluster will be created with the docker container configured with theDockerfile
anddocker-compose.yml
files. To run this producer, you need to- run the container using this instructions
- create a Python virtual environment, activate it and install the needed packages from
requirements.txt
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt
- navigate to the folder
notebooks/streaming_processing
and run the script usingpython kafka_producer.py
- you should start seeing messages flowing to the Kafka cluster
streaming_processing.ipynb
- this is the notebook file that will have all the steps needed to connect with the Kafka cluster, consume the messages and then process them
To run these notebooks, ensure you have the necessary dependencies installed and configured on your system.
-
Install Docker
-
Download required data: Run the next command to get the data needed for this exercise. If you don't have
wget
installed, follow the instructions below.wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet -P spark_data/
macOS
You can installwget
using Homebrew:brew install wget
Linux
On Debian-based distributions (like Ubuntu), installwget
using:sudo apt-get update sudo apt-get install wget
Red Hat-based systems (like Fedora)
sudo dnf install wget
Windows
On Windows, installwget
as part of Git Bash or through a third-party package manager like Chocolatey:choco install wget
- Build the Docker container:
docker-compose build --no-cache
- Run the Docker container:
docker-compose up -d
- Open the following links to ensure everything is working properly
Jupyter Notebook
Spark Cluster UI