mod3-spark

This folder contains Jupyter notebooks designed to explore the usage of PySpark, a powerful tool for distributed data processing. Inside this repo you will find examples for both batch processing as well as streaming. Those are:

For batch we only use PySpark and you can find the notebook for this inside the folder notebooks/batch_procssing. Just follow the instructions below in order to have all the data needed for this process
For streaming, you can see the notebook inside the folder notebooks/streaming_processing. Under this folder you have:
- kafka_producer.py - this is a simple Python script that will publish random messages into a Kafka Topic. The Kafka cluster will be created with the docker container configured with the Dockerfile and docker-compose.yml files. To run this producer, you need to
  - run the container using this instructions
  - create a Python virtual environment, activate it and install the needed packages from requirements.txt
```
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
  - navigate to the folder notebooks/streaming_processing and run the script using
```
python kafka_producer.py
```
  - you should start seeing messages flowing to the Kafka cluster
- streaming_processing.ipynb - this is the notebook file that will have all the steps needed to connect with the Kafka cluster, consume the messages and then process them

Getting Started

To run these notebooks, ensure you have the necessary dependencies installed and configured on your system.

Prerequisites

Install Docker
Download required data: Run the next command to get the data needed for this exercise. If you don't have wget installed, follow the instructions below.
```
wget https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2023-01.parquet -P spark_data/
```
Install wget if needed

macOS
You can install wget using Homebrew:
```
brew install wget
```
Linux
On Debian-based distributions (like Ubuntu), install wget using:
```
sudo apt-get update
sudo apt-get install wget
```
Red Hat-based systems (like Fedora)
```
sudo dnf install wget
```
Windows
On Windows, install wget as part of Git Bash or through a third-party package manager like Chocolatey:
```
choco install wget
```

Running the Notebooks

Build the Docker container:
```
docker-compose build --no-cache
```
Run the Docker container:
```
docker-compose up -d
```
Open the following links to ensure everything is working properly
Jupyter Notebook
Spark Cluster UI

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
airflow		airflow
notebooks		notebooks
spark_data		spark_data
src/ingestion		src/ingestion
static		static
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mod3-spark

Getting Started

Prerequisites

Install `wget` if needed

Running the Notebooks

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

GADES-DATAENG/mod3-spark

Folders and files

Latest commit

History

Repository files navigation

mod3-spark

Getting Started

Prerequisites

Install wget if needed

Running the Notebooks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Install `wget` if needed

Packages