Initial idea and planned outline:
- Stream CPU and RAM usage measurements with one or more Python scripts to a Kafka topic.
- Multiple measurements per second.
- Process the stream with Flink to get aggregates per second (tumbling windows).
- Stream the window aggregate data from Flink to a Postgres table/Timescale hypertable.
- Visualize live and historical data with Streamlit.
Setup before running the project:
poetry install(add--with devto include dev dependencies which are justmypyand stubs)- Download the JARs needed by Flink. The list of URLs to download from Maven Repository are in the file
flink/needed_jar_urls.txt. The JARs should be placed in the directoryjars/, but if you want to place them in a different directory, you can pass the directory path as the parameter--jar-pathwhen running the PyFlink scriptflink/process_stream.py - (optional)
docker compose --profile cc pull
Running the project (note: the instructions assume you're running the commands from the root of this repository):
- If you DON'T want/need Control Center running:
docker compose up -d - If you DO want Control Center:
docker compose --profile cc up -d - Create the
measurementstopic (alternatively you can use the flag--allow-new-topicwhen starting the producer script to create the topic):
docker exec kafka1 /bin/bash -c "kafka-topics --create --topic measurements --bootstrap-server localhost:9092"
- Open two shells and activate the virtual environment with the dependencies of this project in both of the shells:
poetry shell - Start producing measurements in one of the shells:
python measurement_spammer/spam_measurements.py(add--helpto see info on parameters). - In the other shell, start the Flink job that processes the stream and sinks to Postgres/Timescale:
python flink/process_stream.py(add--helpto see info on parameters). - To run and view the Streamlit app, open yet another shell, open the virtual environment with
poetry shell, and runpython -m streamlit run app/main.pyto start the app. Then you can access the app atlocalhost:8501in your browser. - You can pass parameters to the app by running
python -m streamlit run app/main.py -- --param; note the extra--. Currently the supported parameters are:--refresh-interval (float): Interval for how often data on the page is refreshed. Default: 1.0--pg-conn (text): Postgres connection string. Default:postgresql://postgres:postgres@localhost:5432/
Running static type checks with mypy:
poetry install --with devif you didn't install dev dependencies already.poetry shell- To check the Streamlit app code:
python -m mypy app/ - To check the PyFlink script:
python -m mypy flink/ - To check the sciprt that sends measurements to Kafka:
python -m mypy measurement_spammer/
Note: python -m mypy . fails with Duplicate module named "utils" since there are utils modules under app/ and measurement_spammer/.
