Various utilities for benchmarking HarmonicIO, Apache Spark, and the HASTE pipeline:
- Python tool for benchmarking HASTE pipeline using simulator
- python streaming server - can stream messages either over TCP or to disk
- PySpark application which processes messages (either over TCP or from disk) (deprecated for file streaming)
- Scala Spark application which processes messages (only supports disk)
- Python-based throttling app - queries running application, and throttles stream source application to determine max throughput.
- Setup to get dependencies installed:
$ pip3 install -e .
-
Clone https://github.com/HASTE-project/HarmonicIOSetup (into a sibling directory)
-
Use it to setup HarmonicIO cluster
-
Set relative path of 'HarmonicIOSetup', number of nodes, etc.
-
To do the benchmarking (saves results as a text file):
$ python3 -m haste.benchmarking.benchmark
- To plot results
$ python3 -m haste.benchmarking.plot_benchmark
- To pull image and start containers (for production state):
python3 -m haste.benchmarking.harmonic_io
Deploy on a 'stream-src' host.
Preliminary for disk-based streaming - configure NFS share and mount on all Spark nodes.
python3 -m haste.benchmarking.streaming_server Control server will listen on :8080
Deploy onto driver node. Works OK for TCP streaming. Issue when files are deleted on disk based streaming (use Scala app instead)
Run this locally.
- Deploy the Benchmarking Spark App remotely
- Deploy the Streaming Source App remotely
- Start the throttling app locally.
Contributors: Ben Blamey