Real-Time Streaming Pipeline — Apache Kafka · Spark Structured Streaming

Overview

A real-time streaming pipeline using Apache Kafka producers/consumers and Spark Structured Streaming to process event data with sub-second latency. Built with consumer group partitioning, offset checkpointing, and write-ahead logs for exactly-once processing semantics and fault tolerance.

Architecture

Event Sources
(user clicks, transactions, sensors)
        │
        ▼
┌───────────────────┐
│   Kafka Producer  │  ← Publishes events to topic
│   (producer.py)   │
└────────┬──────────┘
         │  topic: events-stream
         ▼
┌───────────────────┐
│   Kafka Broker    │  ← Partitioned topic (3 partitions)
│   Consumer Group  │  ← Parallel consumption
└────────┬──────────┘
         │
         ▼
┌────────────────────────┐
│  Spark Structured      │
│  Streaming             │
│  (streaming.py)        │
│                        │
│  • Parse JSON events   │
│  • Watermark windowing │
│  • Aggregations        │
│  • Offset checkpoint   │
└────────┬───────────────┘
         │
         ▼
┌───────────────────┐
│  Output Sink      │
│  (Parquet / ADLS) │
└───────────────────┘

Tech Stack

Tool	Purpose
Apache Kafka	Event streaming, topic partitioning, consumer groups
Spark Structured Streaming	Real-time data processing, windowed aggregations
Python (kafka-python)	Kafka producer simulation
PySpark	Stream processing engine
Parquet	Output sink format

Project Structure

project3_kafka_streaming/
├── src/
│   ├── producer.py       # Kafka event producer (simulates event stream)
│   ├── consumer.py       # Basic Kafka consumer (for testing/debugging)
│   └── streaming.py      # Spark Structured Streaming pipeline
├── notebooks/
│   └── streaming_walkthrough.ipynb
├── data/
│   └── sample/
│       └── sample_events.json
├── requirements.txt
└── README.md

Setup

Prerequisites

Apache Kafka running locally (or cloud-managed)
Python 3.8+
Apache Spark 3.x with kafka connector jar

Install dependencies

pip install -r requirements.txt

Start Kafka (local with Docker)

docker run -d --name kafka \
  -p 9092:9092 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
  confluentinc/cp-kafka:latest

Create Kafka topic

kafka-topics.sh --create \
  --topic events-stream \
  --partitions 3 \
  --replication-factor 1 \
  --bootstrap-server localhost:9092

Run the producer (in one terminal)

python src/producer.py

Run the Spark streaming pipeline (in another terminal)

python src/streaming.py

Key Features

Exactly-once Processing

Offset checkpointing via checkpointLocation ensures no event is processed twice, even after pipeline restarts.

Fault Tolerance

Spark write-ahead logs and Kafka offset management allow the pipeline to resume from the last committed offset after a failure.

Windowed Aggregations

Events are aggregated in 1-minute tumbling windows with a 10-second watermark to handle late-arriving data.

Consumer Groups

Multiple Spark executors consume from different Kafka partitions in parallel, enabling horizontal scalability.

Key Results

Sub-second latency on event processing using Kafka + Spark Structured Streaming
Exactly-once semantics via offset checkpointing and write-ahead logs
Horizontal scalability through Kafka topic partitioning and consumer groups

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Streaming Pipeline — Apache Kafka · Spark Structured Streaming

Overview

Architecture

Tech Stack

Project Structure

Setup

Prerequisites

Install dependencies

Start Kafka (local with Docker)

Create Kafka topic

Run the producer (in one terminal)

Run the Spark streaming pipeline (in another terminal)

Key Features

Exactly-once Processing

Fault Tolerance

Windowed Aggregations

Consumer Groups

Key Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Real-Time Streaming Pipeline — Apache Kafka · Spark Structured Streaming

Overview

Architecture

Tech Stack

Project Structure

Setup

Prerequisites

Install dependencies

Start Kafka (local with Docker)

Create Kafka topic

Run the producer (in one terminal)

Run the Spark streaming pipeline (in another terminal)

Key Features

Exactly-once Processing

Fault Tolerance

Windowed Aggregations

Consumer Groups

Key Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages