Skip to content

Distributed network traffic analysis using Apache Spark to detect anomalies and DDoS-like behavior

Notifications You must be signed in to change notification settings

yousef20920/Distributed-Network-Traffic-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

# Distributed NetFlow Analyzer + DDoS Detection (Go + Apache Spark/PySpark)
A phased, detailed plan for a resume-grade distributed systems project (Cisco/Microsoft/NVIDIA-friendly).

---

## 0) Project Summary

### Goal
Build a distributed pipeline that ingests **NetFlow-like** network flow records at high volume (generated by Go services), performs **near real-time** analytics using **Apache Spark Structured Streaming (PySpark)**, and detects **DDoS-like anomalies** (fan-in bursts, scans, heavy hitters). Persist results in **Parquet** and present them in a simple dashboard.

### Why this is meaningful
- **Go** for high-throughput producers + “systems” credibility (concurrency, backpressure)
- **Spark** for distributed streaming analytics (partitioning, shuffles, scaling)
- Realistic domain: **network telemetry** and security signals (very Cisco-adjacent)

### Key deliverables
- Multi-node local deployment (Docker Compose): broker + Spark cluster + multiple Go producers
- Streaming analytics job (PySpark) with windows + alert generation
- Experiments doc proving scale + bottleneck understanding
- Clean README + design doc

---

## 1) Tech Stack

### Core
- **Go** (producers + optional alert service)
  - Concurrency: goroutines + channels
  - Kafka client: `segmentio/kafka-go` (simple) or `confluent-kafka-go` (librdkafka)
- **Apache Spark 3.x** + **PySpark**
  - Spark Structured Streaming from Kafka
  - Event-time windows + watermarks
- **Kafka-compatible broker**
  - **Redpanda** (recommended for local dev) OR Kafka
- **Storage**
  - **Parquet** (bronze/silver/gold layout)
  - Optional: **Delta Lake** (stretch)
- **Docker + Docker Compose**
  - Spark master + workers
  - Broker
  - Go producer replicas

### Optional polish
- **Streamlit** (Python) dashboard
- **Prometheus + Grafana** (pipeline metrics)
- **OpenTelemetry** (tracing) for producer + alert service

---

## 2) Architecture

+-------------------+ +--------------------+ | Go Flow Producers | ---> | Kafka/Redpanda | | (router replicas) | | topic: netflow.raw | +-------------------+ +----------+----------+ | v +----------------------+ | Spark (PySpark) | | Structured Streaming | | - parse & enrich | | - window aggregations| | - DDoS detection | +----+----+----+------+ | | | v v v Parquet Bronze Silver Gold +---------------------------+ | top_talkers, destinations | | alerts, router_summary | +---------------------------+ | v +----------------------+ | Dashboard (Streamlit)| +----------------------+


---

## 3) Data Model

### NetFlow-like JSON record (event)
Each event is a flow summary (not raw packets):

```json
{
  "ts": "2026-01-29T12:34:56.123Z",
  "router_id": "rtr-03",
  "src_ip": "10.10.1.25",
  "dst_ip": "172.16.4.9",
  "src_port": 52314,
  "dst_port": 443,
  "proto": "TCP",
  "bytes": 8421,
  "packets": 12,
  "tcp_flags": "S",
  "sampling_rate": 1
}

Key design choices (for distributed performance)

  • Kafka partition key:

    • For DDoS fan-in detection (many -> one): key by dst_ip
    • For scan detection (one -> many): optionally also publish a second topic keyed by src_ip
  • Spark:

    • Use event time (ts) + watermark for windowed processing
    • Expect data skew during DDoS scenarios (hot destination)

4) Repo Structure (suggested)

netflow-ddos-spark/
  README.md
  docs/
    design.md
    experiments.md
    runbook.md
  docker/
    docker-compose.yml
    spark/
      Dockerfile
    producer/
      Dockerfile
  producer-go/
    cmd/
      producer/
        main.go
    internal/
      config/
        config.go
      netflow/
        model.go
        generator.go
        scenarios.go
      kafka/
        writer.go
      metrics/
        metrics.go (optional)
  spark-pyspark/
    src/
      streaming_job.py
      schema.py
      enrich.py
      aggregates.py
      detection.py
      sinks.py
    tests/
      test_detection.py
      test_enrich.py
  dashboard/
    app.py
  data/
    bronze/
    silver/
    gold/
    checkpoints/

5) Phases (Step-by-step Plan)

Phase 1 — Infrastructure + End-to-End MVP

Outcome: Go producer -> broker -> Spark -> Parquet (bronze).

Tasks

  1. Docker Compose

    • Services:

      • redpanda (or kafka)
      • spark-master
      • spark-worker-1, spark-worker-2
      • producer-go (scale replicas)
    • Create topic netflow.raw with N partitions (start with 6–12)

  2. Go producer (baseline traffic)

    • Implement:

      • config: brokers, topic, events/sec, router_id, seed
      • generator: produces realistic IPs/ports/protos
      • kafka writer: batching + acks
    • Support --scenario=baseline initially

  3. PySpark streaming ingestion

    • Read from Kafka netflow.raw
    • Parse JSON with explicit schema
    • Convert ts -> Spark timestamp column event_time
    • Write to data/bronze/netflow/ in Parquet
    • Configure checkpoint: data/checkpoints/bronze/

Acceptance criteria

  • Running docker-compose up produces continuous Parquet output in bronze
  • Spark job survives small restarts thanks to checkpointing

Phase 2 — Enrichment + Silver Table

Outcome: cleaned/enriched dataset ready for analytics.

Tasks

  1. Add enrichment in Spark:

    • src_subnet_24, dst_subnet_24
    • is_private_src, is_private_dst (optional)
    • dst_service (map ports like 80/443/22/53)
    • pps = packets / flow_duration (if you simulate duration; optional)
  2. Write Silver Parquet:

    • data/silver/netflow_enriched/
    • partition by date and optionally router_id

Acceptance criteria

  • Silver data is consistently updated
  • Schema is stable and documented in docs/design.md

Phase 3 — Windowed Aggregations (Gold Metrics)

Outcome: real-time analytics tables (top talkers, destinations, router summaries).

Metrics (Gold)

Use tumbling windows (e.g., 1 minute). Add watermarks.

  1. Top Talkers (by src_ip)

    • per window: sum(bytes), sum(packets), count(*)
  2. Top Destinations (by dst_ip)

    • per window: totals + approx_count_distinct(src_ip) (unique sources)
  3. Router Summary

    • per router/window: totals + top ports

Data layout

  • Write:

    • data/gold/top_talkers/
    • data/gold/top_destinations/
    • data/gold/router_summary/
  • Partition by date and window_start_hour for query performance

Acceptance criteria

  • Gold outputs update every window
  • You can query the latest window to see obvious “top” entities

Phase 4 — Detection v1 (DDoS + Scans)

Outcome: alert generation from explainable heuristics.

Detection rules (start with these 3)

  1. Fan-in DDoS candidate (many sources -> one destination)

    • For each (dst_ip, window):

      • unique_sources > S
      • AND (packets_total > P OR bytes_total > B)
  2. Fan-out scan candidate (one source -> many destinations)

    • For each (src_ip, window):

      • unique_destinations > D
      • AND optional distinct_ports > K
  3. SYN burst heuristic (optional)

    • If you simulate flags:

      • syn_only_count / flow_count > R

Alerts output (Gold)

Write data/gold/alerts/ with:

  • window_start, window_end
  • alert_type (FAN_IN_DDOS, FAN_OUT_SCAN, SYN_BURST)
  • entity (dst_ip or src_ip)
  • severity_score
  • evidence fields (unique_sources, packets_total, etc.)

Acceptance criteria

  • Triggering an attack scenario produces alerts within 1–2 windows
  • Alerts include clear evidence (counts) for explainability

Phase 5 — Go Producer Scenarios (Realistic Workloads)

Outcome: controllable traffic patterns (baseline, ddos, scan, flash crowd).

Implement in Go (scenarios.go)

  1. baseline

  2. ddos_fan_in

    • many random sources -> one target dst_ip
  3. scan_fan_out

    • one source -> many destinations + ports
  4. flash_crowd

    • many sources -> popular dst_ip:443 but with “legit-like” distribution
    • use this to test false positives

Control knobs

  • EVENTS_PER_SEC
  • ATTACK_INTENSITY
  • TARGET_IP
  • BOTNET_SIZE
  • DURATION_SECONDS

Acceptance criteria

  • You can run producer replicas with different scenarios
  • System remains stable under increased event rates

Phase 6 — Data Skew + Heavy Hitters (Optimization Phase)

Outcome: pipeline remains performant under hot keys (critical for DDoS).

Problem

During fan-in attacks, (dst_ip=target) becomes a hot key, causing one partition to do most work.

Mitigation options (implement at least one)

  1. Key salting + two-stage aggregation

    • Add salt = hash(src_ip) % N for hot destination keys
    • Aggregate by (dst_ip, salt, window) then re-aggregate by (dst_ip, window)
  2. Two-level aggregation without explicit salt (less control)

    • Use repartition and partial aggregates carefully

Acceptance criteria

  • Under DDoS scenario, job does not stall with massive skew
  • Document before/after effect in docs/experiments.md

Phase 7 — Reliability & Recovery

Outcome: demonstrate fault tolerance and stable outputs.

Tests to run

  • Kill/restart Spark driver container (job resumes)
  • Kill one Spark worker (job continues)
  • Restart broker briefly (job recovers)

Ensure

  • Checkpoint dirs per sink
  • Deterministic output paths
  • Discuss idempotency and possible duplicates (honestly) + mitigations

Acceptance criteria

  • System resumes without manual cleanup
  • Alerts do not explode uncontrollably after restart

Phase 8 — Evaluation & Experiments (Make it Resume-Grade)

Outcome: a short but strong experimental report.

Experiments

  1. Scale-out

    • 1 vs 2 vs 4 workers
    • find max sustainable events/sec
  2. Window size trade-offs

    • 10s vs 60s windows
    • detection speed vs compute cost
  3. Skew impact

    • ddos on/off
    • show processing lag and throughput changes

Metrics to collect

  • events/sec ingested (producer logs)
  • micro-batch duration or streaming query progress
  • end-to-end detection latency (event_time -> alert_time)
  • shuffle-heavy stages (Spark UI notes)

Acceptance criteria

  • docs/experiments.md includes tables/plots + a “bottlenecks” section

Phase 9 — Dashboard + Demo

Outcome: a 60–90s demo you can show recruiters.

Streamlit pages

  • Live top destinations (last N windows)
  • Top talkers
  • Alerts feed (sorted by severity)
  • Scenario notes (what you turned on)

Demo flow

  1. Baseline (no alerts)
  2. Enable ddos_fan_in -> show spike -> alert appears
  3. Switch to scan_fan_out -> scan alerts appear
  4. Recovery after scenario ends

Acceptance criteria

  • A clean demo recording + screenshots in README

6) Implementation Guidelines (Important Details)

Go Producer (throughput + correctness)

  • Batch writes (e.g., flush every X ms or Y messages)

  • Control rate via ticker (events/sec) + allow burst mode

  • Add per-producer unique router_id

  • Log:

    • produced events/sec
    • avg batch size
    • publish errors/retries

Spark streaming (correctness + performance)

  • Use event-time + watermark:

    • withWatermark("event_time", "2 minutes")
  • Choose output modes carefully:

    • aggregations often use append with watermark
  • Be explicit about shuffles:

    • groupBy(window, dst_ip) causes shuffle
  • Keep schema explicit and versioned in code


7) Milestones Checklist

  • Phase 1: End-to-end (Go -> Kafka -> Spark -> Bronze)
  • Phase 2: Silver enrichment
  • Phase 3: Gold metrics
  • Phase 4: Alerts table
  • Phase 5: Scenarios (ddos/scan/flash crowd)
  • Phase 6: Skew mitigation
  • Phase 7: Recovery tests
  • Phase 8: Experiments report
  • Phase 9: Dashboard + demo

8) Resume Bullet Templates (Go + Spark)

  • Built a distributed NetFlow-style analytics pipeline using Go producers and Apache Spark Structured Streaming to process high-volume network telemetry in near real time
  • Implemented event-time windowed aggregations and DDoS/scan detection with fault-tolerant checkpointing and partitioned Parquet sinks
  • Mitigated hot-key data skew during fan-in attacks using two-stage aggregation (key salting), improving pipeline stability under adversarial traffic patterns
  • Benchmarked scaling across cluster sizes and event rates; analyzed shuffle bottlenecks and end-to-end alert latency using Spark streaming metrics

9) Stretch Goals (pick 1–2 max)

  • Publish alerts back to Kafka topic netflow.alerts, and build a small Go alert service that deduplicates + rate-limits + exposes REST/gRPC
  • Add approximate heavy-hitter detection (Count-Min Sketch)
  • Add pipeline health metrics (Prometheus) + dashboard
  • Add “false positive suppression” using rolling baseline (flash crowd vs attack)

10) When You Start Later (Recommended Order)

  1. Phase 1 end-to-end working (don’t optimize early)
  2. Add Phase 3 aggregations (you’ll learn Spark windows)
  3. Add one scenario + one detector (Phase 4–5)
  4. Only then do skew mitigation + experiments

End of plan.

::contentReference[oaicite:0]{index=0}

About

Distributed network traffic analysis using Apache Spark to detect anomalies and DDoS-like behavior

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published