❯ Backtest popular trading strategies with real-time ingestion
Backtesting is the most important aspect of trading strategy development- the best way to predict the future is given past data. Using historical data, Backtesting engines like this one can operate giving a retroactive view into how well a particular strategy could have performed if actually executed in the past. As an extension into this idea, it would be neat to see how strategies profit in a real-time manner, being able to see the result of new ticks daily on overall position and ending cash. This project does that- we provide a mock stream of real OHLC (Daily open/high/low/close) data to the strategy, and we can view the results after every tick.
The Python app, made with Flask, requires inputs to track the historical data of a particular security, given a date range. Using a seperate Producer process, this data is fetched from Polygon.io, noting that the API does have data limits based on the subscription tier. The Producer then submits this data into an Apache Kafka stream, day by day at a particular user-defined rate (although 1hz works best). An Apache Spark Structured Streaming job subscribes to this Kafka topic and reads new entries periodically, which simulates the real-time-ness of the data. This new data is then processed based on a particular strategy, and relevant strategy parameters. Finally, it is displayed, with strategy parameters (e.g. moving averages) overlayed on the ticks, to visually represent how the strategy works and performs.
When the fast period crosses over the slow one, we go long and buy as many shares possible with the current cash amount. When the fast crosses under the slow one, we go short and short as many shares possible with the current cash amount. At each crossover, if we have already bought shares/shorts, sell everything.
Similar details to simple moving average, but the moving averages are weighted exponentially to favour more recent data as opposed to older data points.
The Relative Strength Index is a value between 0 and 100, that reflects overbought or oversold conditions. Given an N tick window, we first compute the average gain and average loss in the timespan (no negatives, 0 instead), and incorporates every next tick into the average (Wilder's Smoothing). The relative strength (RS) is simply gainAvg / lossAvg, and RSI is 100 - (1 + RS). An RSI value of > 70 indicates it is "too high" and should be shorted, whereas an RSI value of < 30 indicates the price may be "too low" and should be bought. We buy and sell according to the same rules in the SMA strategy above.
Backend built with:
- Python 3.10
- Apache Kafka
- Apache Spark (Structured Streaming)
- Polygon.io
Frontend built with:
- Flask (web framework)
- Dash (Plotly)
Below is a working demo of the frontend, and backend.
2025-05-04.20-45-53.mp4
Before getting started with this project, ensure your runtime environment meets the following requirements:
- Programming Language: Python 3.10+
- Package Manager: pip
Build from source:
- Clone the Backtest repository:
git clone https://github.com/twilkhoo/Backtest- Navigate to the project directory:
cd Backtest- Install the project dependencies:
sudo apt update
sudo apt install -y openjdk-11-jdk python3 python3-venv python3-pip wget tarStart Zookeeper and Kafka (each in its own terminal)
cd kafka && bin/zookeeper-server-start.sh config/zookeeper.properties
cd kafka && bin/kafka-server-start.sh config/server.properties
Create a new Kafka topic
cd kafka
bin/kafka-topics.sh --create --topic ohlcv --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Create and activate venv
cd ~
python3 -m venv spark-kafka-env
source spark-kafka-env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Run the Flask UI (which will automatically start up the consumer, and invoke the producer when details are submitted)
python app3.py
Deleting a topic (for testing)
bin/kafka-topics.sh --bootstrap-server localhost:9092 --delete --topic <topic_name>
bin/kafka-topics.sh --bootstrap-server localhost:9092 --list
Stopping ZooKeeper
bin/zookeeper-server-stop.sh
-
Right now, the form of creating strategies isn't very extensible. I'm essentially writing each one from scratch, loosely following the general pattern that each strategy needs to operate on the entire timeseries dataframe, and return a dash figure (chart). Backtesting.py has a much more extensible API- define a class, constructor, and next function (like an iterator) to consume one piece of data (like a tick) and adjust its parameters this way. We can use an MVC architecture to decouple the actual strategy from the display.
-
The Moving Average period should compute data for the current data point too, so we should fetch OHLC data X days before the start date for an X-day moving average.
-
Optimizations: Instead of having Spark return the entire df after a new tick, just return the update. With this, we can modify our moving average/rsi functions to recompute state only using the next tick, instead of requiring the entire df again. This isn't too much of an issue with the moving average strategies, as they only scan data required for the moving average computations, but for the rsi, it ends up doing a whole linear scan.
-
Also, for what it's worth, Spark Structured Streaming may not have been the best choice for inherent realtime data- Apache Flink may have been a better choice for such an event-driven architecture. My Databricks internship was going to be on Spark Streaming infra so I wanted to get used to the platform, but otherwise Spark Streaming's batching mechanism isn't the best choice.
This project is protected under the MIT License. For more details, refer to the LICENSE file.
