GitHub - yousef20920/Distributed-Network-Traffic-Analyzer: Distributed network traffic analysis using Apache Spark to detect anomalies and DDoS-like behavior

# Distributed NetFlow Analyzer + DDoS Detection (Go + Apache Spark/PySpark)
A phased, detailed plan for a resume-grade distributed systems project (Cisco/Microsoft/NVIDIA-friendly).

---

## 0) Project Summary

### Goal
Build a distributed pipeline that ingests **NetFlow-like** network flow records at high volume (generated by Go services), performs **near real-time** analytics using **Apache Spark Structured Streaming (PySpark)**, and detects **DDoS-like anomalies** (fan-in bursts, scans, heavy hitters). Persist results in **Parquet** and present them in a simple dashboard.

### Why this is meaningful
- **Go** for high-throughput producers + “systems” credibility (concurrency, backpressure)
- **Spark** for distributed streaming analytics (partitioning, shuffles, scaling)
- Realistic domain: **network telemetry** and security signals (very Cisco-adjacent)

### Key deliverables
- Multi-node local deployment (Docker Compose): broker + Spark cluster + multiple Go producers
- Streaming analytics job (PySpark) with windows + alert generation
- Experiments doc proving scale + bottleneck understanding
- Clean README + design doc

---

## 1) Tech Stack

### Core
- **Go** (producers + optional alert service)
  - Concurrency: goroutines + channels
  - Kafka client: `segmentio/kafka-go` (simple) or `confluent-kafka-go` (librdkafka)
- **Apache Spark 3.x** + **PySpark**
  - Spark Structured Streaming from Kafka
  - Event-time windows + watermarks
- **Kafka-compatible broker**
  - **Redpanda** (recommended for local dev) OR Kafka
- **Storage**
  - **Parquet** (bronze/silver/gold layout)
  - Optional: **Delta Lake** (stretch)
- **Docker + Docker Compose**
  - Spark master + workers
  - Broker
  - Go producer replicas

### Optional polish
- **Streamlit** (Python) dashboard
- **Prometheus + Grafana** (pipeline metrics)
- **OpenTelemetry** (tracing) for producer + alert service

---

## 2) Architecture

+-------------------+ +--------------------+ | Go Flow Producers | ---> | Kafka/Redpanda | | (router replicas) | | topic: netflow.raw | +-------------------+ +----------+----------+ | v +----------------------+ | Spark (PySpark) | | Structured Streaming | | - parse & enrich | | - window aggregations| | - DDoS detection | +----+----+----+------+ | | | v v v Parquet Bronze Silver Gold +---------------------------+ | top_talkers, destinations | | alerts, router_summary | +---------------------------+ | v +----------------------+ | Dashboard (Streamlit)| +----------------------+


---

## 3) Data Model

### NetFlow-like JSON record (event)
Each event is a flow summary (not raw packets):

```json
{
  "ts": "2026-01-29T12:34:56.123Z",
  "router_id": "rtr-03",
  "src_ip": "10.10.1.25",
  "dst_ip": "172.16.4.9",
  "src_port": 52314,
  "dst_port": 443,
  "proto": "TCP",
  "bytes": 8421,
  "packets": 12,
  "tcp_flags": "S",
  "sampling_rate": 1
}

Key design choices (for distributed performance)

Kafka partition key:
- For DDoS fan-in detection (many -> one): key by dst_ip
- For scan detection (one -> many): optionally also publish a second topic keyed by src_ip
Spark:
- Use event time (ts) + watermark for windowed processing
- Expect data skew during DDoS scenarios (hot destination)

4) Repo Structure (suggested)

netflow-ddos-spark/
  README.md
  docs/
    design.md
    experiments.md
    runbook.md
  docker/
    docker-compose.yml
    spark/
      Dockerfile
    producer/
      Dockerfile
  producer-go/
    cmd/
      producer/
        main.go
    internal/
      config/
        config.go
      netflow/
        model.go
        generator.go
        scenarios.go
      kafka/
        writer.go
      metrics/
        metrics.go (optional)
  spark-pyspark/
    src/
      streaming_job.py
      schema.py
      enrich.py
      aggregates.py
      detection.py
      sinks.py
    tests/
      test_detection.py
      test_enrich.py
  dashboard/
    app.py
  data/
    bronze/
    silver/
    gold/
    checkpoints/

5) Phases (Step-by-step Plan)

Phase 1 — Infrastructure + End-to-End MVP

Outcome: Go producer -> broker -> Spark -> Parquet (bronze).

Tasks

Docker Compose
- Services:
  - redpanda (or kafka)
  - spark-master
  - spark-worker-1, spark-worker-2
  - producer-go (scale replicas)
- Create topic netflow.raw with N partitions (start with 6–12)
Go producer (baseline traffic)
- Implement:
  - config: brokers, topic, events/sec, router_id, seed
  - generator: produces realistic IPs/ports/protos
  - kafka writer: batching + acks
- Support --scenario=baseline initially
PySpark streaming ingestion
- Read from Kafka netflow.raw
- Parse JSON with explicit schema
- Convert ts -> Spark timestamp column event_time
- Write to data/bronze/netflow/ in Parquet
- Configure checkpoint: data/checkpoints/bronze/

Acceptance criteria

Running docker-compose up produces continuous Parquet output in bronze
Spark job survives small restarts thanks to checkpointing

Phase 2 — Enrichment + Silver Table

Outcome: cleaned/enriched dataset ready for analytics.

Tasks

Add enrichment in Spark:
- src_subnet_24, dst_subnet_24
- is_private_src, is_private_dst (optional)
- dst_service (map ports like 80/443/22/53)
- pps = packets / flow_duration (if you simulate duration; optional)
Write Silver Parquet:
- data/silver/netflow_enriched/
- partition by date and optionally router_id

Acceptance criteria

Silver data is consistently updated
Schema is stable and documented in docs/design.md

Phase 3 — Windowed Aggregations (Gold Metrics)

Outcome: real-time analytics tables (top talkers, destinations, router summaries).

Metrics (Gold)

Use tumbling windows (e.g., 1 minute). Add watermarks.

Top Talkers (by src_ip)
- per window: sum(bytes), sum(packets), count(*)
Top Destinations (by dst_ip)
- per window: totals + approx_count_distinct(src_ip) (unique sources)
Router Summary
- per router/window: totals + top ports

Data layout

Write:
- data/gold/top_talkers/
- data/gold/top_destinations/
- data/gold/router_summary/
Partition by date and window_start_hour for query performance

Acceptance criteria

Gold outputs update every window
You can query the latest window to see obvious “top” entities

Phase 4 — Detection v1 (DDoS + Scans)

Outcome: alert generation from explainable heuristics.

Detection rules (start with these 3)

Fan-in DDoS candidate (many sources -> one destination)
- For each (dst_ip, window):
  - unique_sources > S
  - AND (packets_total > P OR bytes_total > B)
Fan-out scan candidate (one source -> many destinations)
- For each (src_ip, window):
  - unique_destinations > D
  - AND optional distinct_ports > K
SYN burst heuristic (optional)
- If you simulate flags:
  - syn_only_count / flow_count > R

Alerts output (Gold)

Write data/gold/alerts/ with:

window_start, window_end
alert_type (FAN_IN_DDOS, FAN_OUT_SCAN, SYN_BURST)
entity (dst_ip or src_ip)
severity_score
evidence fields (unique_sources, packets_total, etc.)

Acceptance criteria

Triggering an attack scenario produces alerts within 1–2 windows
Alerts include clear evidence (counts) for explainability

Phase 5 — Go Producer Scenarios (Realistic Workloads)

Outcome: controllable traffic patterns (baseline, ddos, scan, flash crowd).

Implement in Go (`scenarios.go`)

baseline
ddos_fan_in
- many random sources -> one target dst_ip
scan_fan_out
- one source -> many destinations + ports
flash_crowd
- many sources -> popular dst_ip:443 but with “legit-like” distribution
- use this to test false positives

Control knobs

EVENTS_PER_SEC
ATTACK_INTENSITY
TARGET_IP
BOTNET_SIZE
DURATION_SECONDS

Acceptance criteria

You can run producer replicas with different scenarios
System remains stable under increased event rates

Phase 6 — Data Skew + Heavy Hitters (Optimization Phase)

Outcome: pipeline remains performant under hot keys (critical for DDoS).

Problem

During fan-in attacks, (dst_ip=target) becomes a hot key, causing one partition to do most work.

Mitigation options (implement at least one)

Key salting + two-stage aggregation
- Add salt = hash(src_ip) % N for hot destination keys
- Aggregate by (dst_ip, salt, window) then re-aggregate by (dst_ip, window)
Two-level aggregation without explicit salt (less control)
- Use repartition and partial aggregates carefully

Acceptance criteria

Under DDoS scenario, job does not stall with massive skew
Document before/after effect in docs/experiments.md

Phase 7 — Reliability & Recovery

Outcome: demonstrate fault tolerance and stable outputs.

Tests to run

Kill/restart Spark driver container (job resumes)
Kill one Spark worker (job continues)
Restart broker briefly (job recovers)

Ensure

Checkpoint dirs per sink
Deterministic output paths
Discuss idempotency and possible duplicates (honestly) + mitigations

Acceptance criteria

System resumes without manual cleanup
Alerts do not explode uncontrollably after restart

Phase 8 — Evaluation & Experiments (Make it Resume-Grade)

Outcome: a short but strong experimental report.

Experiments

Scale-out
- 1 vs 2 vs 4 workers
- find max sustainable events/sec
Window size trade-offs
- 10s vs 60s windows
- detection speed vs compute cost
Skew impact
- ddos on/off
- show processing lag and throughput changes

Metrics to collect

events/sec ingested (producer logs)
micro-batch duration or streaming query progress
end-to-end detection latency (event_time -> alert_time)
shuffle-heavy stages (Spark UI notes)

Acceptance criteria

docs/experiments.md includes tables/plots + a “bottlenecks” section

Phase 9 — Dashboard + Demo

Outcome: a 60–90s demo you can show recruiters.

Streamlit pages

Live top destinations (last N windows)
Top talkers
Alerts feed (sorted by severity)
Scenario notes (what you turned on)

Demo flow

Baseline (no alerts)
Enable ddos_fan_in -> show spike -> alert appears
Switch to scan_fan_out -> scan alerts appear
Recovery after scenario ends

Acceptance criteria

A clean demo recording + screenshots in README

6) Implementation Guidelines (Important Details)

Go Producer (throughput + correctness)

Batch writes (e.g., flush every X ms or Y messages)
Control rate via ticker (events/sec) + allow burst mode
Add per-producer unique router_id
Log:
- produced events/sec
- avg batch size
- publish errors/retries

Spark streaming (correctness + performance)

Use event-time + watermark:
- withWatermark("event_time", "2 minutes")
Choose output modes carefully:
- aggregations often use append with watermark
Be explicit about shuffles:
- groupBy(window, dst_ip) causes shuffle
Keep schema explicit and versioned in code

7) Milestones Checklist

Phase 1: End-to-end (Go -> Kafka -> Spark -> Bronze)
Phase 2: Silver enrichment
Phase 3: Gold metrics
Phase 4: Alerts table
Phase 5: Scenarios (ddos/scan/flash crowd)
Phase 6: Skew mitigation
Phase 7: Recovery tests
Phase 8: Experiments report
Phase 9: Dashboard + demo

8) Resume Bullet Templates (Go + Spark)

Built a distributed NetFlow-style analytics pipeline using Go producers and Apache Spark Structured Streaming to process high-volume network telemetry in near real time
Implemented event-time windowed aggregations and DDoS/scan detection with fault-tolerant checkpointing and partitioned Parquet sinks
Mitigated hot-key data skew during fan-in attacks using two-stage aggregation (key salting), improving pipeline stability under adversarial traffic patterns
Benchmarked scaling across cluster sizes and event rates; analyzed shuffle bottlenecks and end-to-end alert latency using Spark streaming metrics

9) Stretch Goals (pick 1–2 max)

Publish alerts back to Kafka topic netflow.alerts, and build a small Go alert service that deduplicates + rate-limits + exposes REST/gRPC
Add approximate heavy-hitter detection (Count-Min Sketch)
Add pipeline health metrics (Prometheus) + dashboard
Add “false positive suppression” using rolling baseline (flash crowd vs attack)

10) When You Start Later (Recommended Order)

Phase 1 end-to-end working (don’t optimize early)
Add Phase 3 aggregations (you’ll learn Spark windows)
Add one scenario + one detector (Phase 4–5)
Only then do skew mitigation + experiments

End of plan.

::contentReference[oaicite:0]{index=0}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

yousef20920/Distributed-Network-Traffic-Analyzer

Folders and files

Latest commit

History

Repository files navigation

Key design choices (for distributed performance)

4) Repo Structure (suggested)

5) Phases (Step-by-step Plan)

Phase 1 — Infrastructure + End-to-End MVP

Tasks

Acceptance criteria

Phase 2 — Enrichment + Silver Table

Tasks

Acceptance criteria

Phase 3 — Windowed Aggregations (Gold Metrics)

Metrics (Gold)

Data layout

Acceptance criteria

Phase 4 — Detection v1 (DDoS + Scans)

Detection rules (start with these 3)

Alerts output (Gold)

Acceptance criteria

Phase 5 — Go Producer Scenarios (Realistic Workloads)

Implement in Go (scenarios.go)

Control knobs

Acceptance criteria

Phase 6 — Data Skew + Heavy Hitters (Optimization Phase)

Problem

Mitigation options (implement at least one)

Acceptance criteria

Phase 7 — Reliability & Recovery

Tests to run

Ensure

Acceptance criteria

Phase 8 — Evaluation & Experiments (Make it Resume-Grade)

Experiments

Metrics to collect

Acceptance criteria

Phase 9 — Dashboard + Demo

Streamlit pages

Demo flow

Acceptance criteria

6) Implementation Guidelines (Important Details)

Go Producer (throughput + correctness)

Spark streaming (correctness + performance)

7) Milestones Checklist

8) Resume Bullet Templates (Go + Spark)

9) Stretch Goals (pick 1–2 max)

10) When You Start Later (Recommended Order)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Implement in Go (`scenarios.go`)

Packages