Skip to content

y-preethi/kafka_streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Real-Time Streaming Pipeline — Apache Kafka · Spark Structured Streaming

Overview

A real-time streaming pipeline using Apache Kafka producers/consumers and Spark Structured Streaming to process event data with sub-second latency. Built with consumer group partitioning, offset checkpointing, and write-ahead logs for exactly-once processing semantics and fault tolerance.


Architecture

Event Sources
(user clicks, transactions, sensors)
        │
        ▼
┌───────────────────┐
│   Kafka Producer  │  ← Publishes events to topic
│   (producer.py)   │
└────────┬──────────┘
         │  topic: events-stream
         ▼
┌───────────────────┐
│   Kafka Broker    │  ← Partitioned topic (3 partitions)
│   Consumer Group  │  ← Parallel consumption
└────────┬──────────┘
         │
         ▼
┌────────────────────────┐
│  Spark Structured      │
│  Streaming             │
│  (streaming.py)        │
│                        │
│  • Parse JSON events   │
│  • Watermark windowing │
│  • Aggregations        │
│  • Offset checkpoint   │
└────────┬───────────────┘
         │
         ▼
┌───────────────────┐
│  Output Sink      │
│  (Parquet / ADLS) │
└───────────────────┘

Tech Stack

Tool Purpose
Apache Kafka Event streaming, topic partitioning, consumer groups
Spark Structured Streaming Real-time data processing, windowed aggregations
Python (kafka-python) Kafka producer simulation
PySpark Stream processing engine
Parquet Output sink format

Project Structure

project3_kafka_streaming/
├── src/
│   ├── producer.py       # Kafka event producer (simulates event stream)
│   ├── consumer.py       # Basic Kafka consumer (for testing/debugging)
│   └── streaming.py      # Spark Structured Streaming pipeline
├── notebooks/
│   └── streaming_walkthrough.ipynb
├── data/
│   └── sample/
│       └── sample_events.json
├── requirements.txt
└── README.md

Setup

Prerequisites

  • Apache Kafka running locally (or cloud-managed)
  • Python 3.8+
  • Apache Spark 3.x with kafka connector jar

Install dependencies

pip install -r requirements.txt

Start Kafka (local with Docker)

docker run -d --name kafka \
  -p 9092:9092 \
  -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
  confluentinc/cp-kafka:latest

Create Kafka topic

kafka-topics.sh --create \
  --topic events-stream \
  --partitions 3 \
  --replication-factor 1 \
  --bootstrap-server localhost:9092

Run the producer (in one terminal)

python src/producer.py

Run the Spark streaming pipeline (in another terminal)

python src/streaming.py

Key Features

Exactly-once Processing

Offset checkpointing via checkpointLocation ensures no event is processed twice, even after pipeline restarts.

Fault Tolerance

Spark write-ahead logs and Kafka offset management allow the pipeline to resume from the last committed offset after a failure.

Windowed Aggregations

Events are aggregated in 1-minute tumbling windows with a 10-second watermark to handle late-arriving data.

Consumer Groups

Multiple Spark executors consume from different Kafka partitions in parallel, enabling horizontal scalability.


Key Results

  • Sub-second latency on event processing using Kafka + Spark Structured Streaming
  • Exactly-once semantics via offset checkpointing and write-ahead logs
  • Horizontal scalability through Kafka topic partitioning and consumer groups

About

A real time streaming pipeline using Apache Kafka producers/consumers and Spark Structured Streaming to process event data with subsecond latency. Built with consumer group partitioning, offset checkpointing, and write ahead logs for exactly once processing semantics and fault tolerance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors