# Distributed NetFlow Analyzer + DDoS Detection (Go + Apache Spark/PySpark)
A phased, detailed plan for a resume-grade distributed systems project (Cisco/Microsoft/NVIDIA-friendly).
---
## 0) Project Summary
### Goal
Build a distributed pipeline that ingests **NetFlow-like** network flow records at high volume (generated by Go services), performs **near real-time** analytics using **Apache Spark Structured Streaming (PySpark)**, and detects **DDoS-like anomalies** (fan-in bursts, scans, heavy hitters). Persist results in **Parquet** and present them in a simple dashboard.
### Why this is meaningful
- **Go** for high-throughput producers + “systems” credibility (concurrency, backpressure)
- **Spark** for distributed streaming analytics (partitioning, shuffles, scaling)
- Realistic domain: **network telemetry** and security signals (very Cisco-adjacent)
### Key deliverables
- Multi-node local deployment (Docker Compose): broker + Spark cluster + multiple Go producers
- Streaming analytics job (PySpark) with windows + alert generation
- Experiments doc proving scale + bottleneck understanding
- Clean README + design doc
---
## 1) Tech Stack
### Core
- **Go** (producers + optional alert service)
- Concurrency: goroutines + channels
- Kafka client: `segmentio/kafka-go` (simple) or `confluent-kafka-go` (librdkafka)
- **Apache Spark 3.x** + **PySpark**
- Spark Structured Streaming from Kafka
- Event-time windows + watermarks
- **Kafka-compatible broker**
- **Redpanda** (recommended for local dev) OR Kafka
- **Storage**
- **Parquet** (bronze/silver/gold layout)
- Optional: **Delta Lake** (stretch)
- **Docker + Docker Compose**
- Spark master + workers
- Broker
- Go producer replicas
### Optional polish
- **Streamlit** (Python) dashboard
- **Prometheus + Grafana** (pipeline metrics)
- **OpenTelemetry** (tracing) for producer + alert service
---
## 2) Architecture
+-------------------+ +--------------------+ | Go Flow Producers | ---> | Kafka/Redpanda | | (router replicas) | | topic: netflow.raw | +-------------------+ +----------+----------+ | v +----------------------+ | Spark (PySpark) | | Structured Streaming | | - parse & enrich | | - window aggregations| | - DDoS detection | +----+----+----+------+ | | | v v v Parquet Bronze Silver Gold +---------------------------+ | top_talkers, destinations | | alerts, router_summary | +---------------------------+ | v +----------------------+ | Dashboard (Streamlit)| +----------------------+
---
## 3) Data Model
### NetFlow-like JSON record (event)
Each event is a flow summary (not raw packets):
```json
{
"ts": "2026-01-29T12:34:56.123Z",
"router_id": "rtr-03",
"src_ip": "10.10.1.25",
"dst_ip": "172.16.4.9",
"src_port": 52314,
"dst_port": 443,
"proto": "TCP",
"bytes": 8421,
"packets": 12,
"tcp_flags": "S",
"sampling_rate": 1
}
-
Kafka partition key:
- For DDoS fan-in detection (many -> one): key by
dst_ip - For scan detection (one -> many): optionally also publish a second topic keyed by
src_ip
- For DDoS fan-in detection (many -> one): key by
-
Spark:
- Use event time (
ts) + watermark for windowed processing - Expect data skew during DDoS scenarios (hot destination)
- Use event time (
netflow-ddos-spark/
README.md
docs/
design.md
experiments.md
runbook.md
docker/
docker-compose.yml
spark/
Dockerfile
producer/
Dockerfile
producer-go/
cmd/
producer/
main.go
internal/
config/
config.go
netflow/
model.go
generator.go
scenarios.go
kafka/
writer.go
metrics/
metrics.go (optional)
spark-pyspark/
src/
streaming_job.py
schema.py
enrich.py
aggregates.py
detection.py
sinks.py
tests/
test_detection.py
test_enrich.py
dashboard/
app.py
data/
bronze/
silver/
gold/
checkpoints/
Outcome: Go producer -> broker -> Spark -> Parquet (bronze).
-
Docker Compose
-
Services:
- redpanda (or kafka)
- spark-master
- spark-worker-1, spark-worker-2
- producer-go (scale replicas)
-
Create topic
netflow.rawwith N partitions (start with 6–12)
-
-
Go producer (baseline traffic)
-
Implement:
- config: brokers, topic, events/sec, router_id, seed
- generator: produces realistic IPs/ports/protos
- kafka writer: batching + acks
-
Support
--scenario=baselineinitially
-
-
PySpark streaming ingestion
- Read from Kafka
netflow.raw - Parse JSON with explicit schema
- Convert
ts-> Sparktimestampcolumnevent_time - Write to
data/bronze/netflow/in Parquet - Configure checkpoint:
data/checkpoints/bronze/
- Read from Kafka
- Running
docker-compose upproduces continuous Parquet output inbronze - Spark job survives small restarts thanks to checkpointing
Outcome: cleaned/enriched dataset ready for analytics.
-
Add enrichment in Spark:
src_subnet_24,dst_subnet_24is_private_src,is_private_dst(optional)dst_service(map ports like 80/443/22/53)pps = packets / flow_duration(if you simulate duration; optional)
-
Write Silver Parquet:
data/silver/netflow_enriched/- partition by
dateand optionallyrouter_id
- Silver data is consistently updated
- Schema is stable and documented in
docs/design.md
Outcome: real-time analytics tables (top talkers, destinations, router summaries).
Use tumbling windows (e.g., 1 minute). Add watermarks.
-
Top Talkers (by src_ip)
- per window:
sum(bytes),sum(packets),count(*)
- per window:
-
Top Destinations (by dst_ip)
- per window: totals +
approx_count_distinct(src_ip)(unique sources)
- per window: totals +
-
Router Summary
- per router/window: totals + top ports
-
Write:
data/gold/top_talkers/data/gold/top_destinations/data/gold/router_summary/
-
Partition by
dateandwindow_start_hourfor query performance
- Gold outputs update every window
- You can query the latest window to see obvious “top” entities
Outcome: alert generation from explainable heuristics.
-
Fan-in DDoS candidate (many sources -> one destination)
-
For each
(dst_ip, window):unique_sources > S- AND (
packets_total > PORbytes_total > B)
-
-
Fan-out scan candidate (one source -> many destinations)
-
For each
(src_ip, window):unique_destinations > D- AND optional
distinct_ports > K
-
-
SYN burst heuristic (optional)
-
If you simulate flags:
syn_only_count / flow_count > R
-
Write data/gold/alerts/ with:
- window_start, window_end
- alert_type (FAN_IN_DDOS, FAN_OUT_SCAN, SYN_BURST)
- entity (dst_ip or src_ip)
- severity_score
- evidence fields (unique_sources, packets_total, etc.)
- Triggering an attack scenario produces alerts within 1–2 windows
- Alerts include clear evidence (counts) for explainability
Outcome: controllable traffic patterns (baseline, ddos, scan, flash crowd).
-
baseline
-
ddos_fan_in
- many random sources -> one target dst_ip
-
scan_fan_out
- one source -> many destinations + ports
-
flash_crowd
- many sources -> popular dst_ip:443 but with “legit-like” distribution
- use this to test false positives
EVENTS_PER_SECATTACK_INTENSITYTARGET_IPBOTNET_SIZEDURATION_SECONDS
- You can run producer replicas with different scenarios
- System remains stable under increased event rates
Outcome: pipeline remains performant under hot keys (critical for DDoS).
During fan-in attacks, (dst_ip=target) becomes a hot key, causing one partition to do most work.
-
Key salting + two-stage aggregation
- Add
salt = hash(src_ip) % Nfor hot destination keys - Aggregate by
(dst_ip, salt, window)then re-aggregate by(dst_ip, window)
- Add
-
Two-level aggregation without explicit salt (less control)
- Use
repartitionand partial aggregates carefully
- Use
- Under DDoS scenario, job does not stall with massive skew
- Document before/after effect in
docs/experiments.md
Outcome: demonstrate fault tolerance and stable outputs.
- Kill/restart Spark driver container (job resumes)
- Kill one Spark worker (job continues)
- Restart broker briefly (job recovers)
- Checkpoint dirs per sink
- Deterministic output paths
- Discuss idempotency and possible duplicates (honestly) + mitigations
- System resumes without manual cleanup
- Alerts do not explode uncontrollably after restart
Outcome: a short but strong experimental report.
-
Scale-out
- 1 vs 2 vs 4 workers
- find max sustainable events/sec
-
Window size trade-offs
- 10s vs 60s windows
- detection speed vs compute cost
-
Skew impact
- ddos on/off
- show processing lag and throughput changes
- events/sec ingested (producer logs)
- micro-batch duration or streaming query progress
- end-to-end detection latency (event_time -> alert_time)
- shuffle-heavy stages (Spark UI notes)
docs/experiments.mdincludes tables/plots + a “bottlenecks” section
Outcome: a 60–90s demo you can show recruiters.
- Live top destinations (last N windows)
- Top talkers
- Alerts feed (sorted by severity)
- Scenario notes (what you turned on)
- Baseline (no alerts)
- Enable ddos_fan_in -> show spike -> alert appears
- Switch to scan_fan_out -> scan alerts appear
- Recovery after scenario ends
- A clean demo recording + screenshots in README
-
Batch writes (e.g., flush every X ms or Y messages)
-
Control rate via ticker (events/sec) + allow burst mode
-
Add per-producer unique
router_id -
Log:
- produced events/sec
- avg batch size
- publish errors/retries
-
Use event-time + watermark:
withWatermark("event_time", "2 minutes")
-
Choose output modes carefully:
- aggregations often use
appendwith watermark
- aggregations often use
-
Be explicit about shuffles:
- groupBy(window, dst_ip) causes shuffle
-
Keep schema explicit and versioned in code
- Phase 1: End-to-end (Go -> Kafka -> Spark -> Bronze)
- Phase 2: Silver enrichment
- Phase 3: Gold metrics
- Phase 4: Alerts table
- Phase 5: Scenarios (ddos/scan/flash crowd)
- Phase 6: Skew mitigation
- Phase 7: Recovery tests
- Phase 8: Experiments report
- Phase 9: Dashboard + demo
- Built a distributed NetFlow-style analytics pipeline using Go producers and Apache Spark Structured Streaming to process high-volume network telemetry in near real time
- Implemented event-time windowed aggregations and DDoS/scan detection with fault-tolerant checkpointing and partitioned Parquet sinks
- Mitigated hot-key data skew during fan-in attacks using two-stage aggregation (key salting), improving pipeline stability under adversarial traffic patterns
- Benchmarked scaling across cluster sizes and event rates; analyzed shuffle bottlenecks and end-to-end alert latency using Spark streaming metrics
- Publish alerts back to Kafka topic
netflow.alerts, and build a small Go alert service that deduplicates + rate-limits + exposes REST/gRPC - Add approximate heavy-hitter detection (Count-Min Sketch)
- Add pipeline health metrics (Prometheus) + dashboard
- Add “false positive suppression” using rolling baseline (flash crowd vs attack)
- Phase 1 end-to-end working (don’t optimize early)
- Add Phase 3 aggregations (you’ll learn Spark windows)
- Add one scenario + one detector (Phase 4–5)
- Only then do skew mitigation + experiments
End of plan.
::contentReference[oaicite:0]{index=0}