A real-time streaming pipeline using Apache Kafka producers/consumers and Spark Structured Streaming to process event data with sub-second latency. Built with consumer group partitioning, offset checkpointing, and write-ahead logs for exactly-once processing semantics and fault tolerance.
Event Sources
(user clicks, transactions, sensors)
│
▼
┌───────────────────┐
│ Kafka Producer │ ← Publishes events to topic
│ (producer.py) │
└────────┬──────────┘
│ topic: events-stream
▼
┌───────────────────┐
│ Kafka Broker │ ← Partitioned topic (3 partitions)
│ Consumer Group │ ← Parallel consumption
└────────┬──────────┘
│
▼
┌────────────────────────┐
│ Spark Structured │
│ Streaming │
│ (streaming.py) │
│ │
│ • Parse JSON events │
│ • Watermark windowing │
│ • Aggregations │
│ • Offset checkpoint │
└────────┬───────────────┘
│
▼
┌───────────────────┐
│ Output Sink │
│ (Parquet / ADLS) │
└───────────────────┘
| Tool | Purpose |
|---|---|
| Apache Kafka | Event streaming, topic partitioning, consumer groups |
| Spark Structured Streaming | Real-time data processing, windowed aggregations |
| Python (kafka-python) | Kafka producer simulation |
| PySpark | Stream processing engine |
| Parquet | Output sink format |
project3_kafka_streaming/
├── src/
│ ├── producer.py # Kafka event producer (simulates event stream)
│ ├── consumer.py # Basic Kafka consumer (for testing/debugging)
│ └── streaming.py # Spark Structured Streaming pipeline
├── notebooks/
│ └── streaming_walkthrough.ipynb
├── data/
│ └── sample/
│ └── sample_events.json
├── requirements.txt
└── README.md
- Apache Kafka running locally (or cloud-managed)
- Python 3.8+
- Apache Spark 3.x with kafka connector jar
pip install -r requirements.txtdocker run -d --name kafka \
-p 9092:9092 \
-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
confluentinc/cp-kafka:latestkafka-topics.sh --create \
--topic events-stream \
--partitions 3 \
--replication-factor 1 \
--bootstrap-server localhost:9092python src/producer.pypython src/streaming.pyOffset checkpointing via checkpointLocation ensures no event is processed twice, even after pipeline restarts.
Spark write-ahead logs and Kafka offset management allow the pipeline to resume from the last committed offset after a failure.
Events are aggregated in 1-minute tumbling windows with a 10-second watermark to handle late-arriving data.
Multiple Spark executors consume from different Kafka partitions in parallel, enabling horizontal scalability.
- Sub-second latency on event processing using Kafka + Spark Structured Streaming
- Exactly-once semantics via offset checkpointing and write-ahead logs
- Horizontal scalability through Kafka topic partitioning and consumer groups