Alright we do inference but do it schwifty.
A simple model serving for lite ML models. Batches requests, handles back pressure, circuit breaker stuff (not active). Keeps things from blowing up when too many requests come in.
- Anything pickled with joblib (including many sklearn models)
- Pure python mode (only for those that are not picklable, or some numba related stuff)
- ONNX runtime, this is the place where the whole stack shines schwiftly!
- Torch mode (only cpu based models) --> jit compile and run!
- Clone this repo
- Put your model into models/model_name/{model_file + model_config.yaml}
- Run with docker:
docker compose up --build
Check the models/ folder for config files. You can tweak batch sizes and stuff there.
First check if your model is alive and kicking:
curl -f http://localhost:8005/available_modelsThere are two endpoints: /predict (immediate) and /predict_batched (batched, ~10ms delay for better throughput)
Request Format:
curl -X POST http://localhost:8005/predict/{model_name} \
-H "Content-Type: application/json" \
-d '{"features": [[1.0, 2.0, 3.0]]}'Request Body:
{
"features": [[1.0, 2.0, 3.0]]
}Response:
{
"model": "model_name",
"predictions": [[0.95, 0.05]]
}Example (sklearn model):
curl -X POST http://localhost:8005/predict/my_model \
-H "Content-Type: application/json" \
-d '{"features": [[5.1, 3.5, 1.4, 0.2], [7.2, 3.2, 6.0, 1.8]]}'Same input format as /predict, but requests are batched together for better efficiency. You may wait a bit (<= 10ms) while waiting for other requests to batch together. For this endpoint no requests that are already batched are allowed!
Request Format:
curl -X POST http://localhost:8005/predict_batched/{model_name} \
-H "Content-Type: application/json" \
-d '{"features": [[1.0, 2.0, 3.0]]}'Response:
{
"model": "model_name",
"predictions": [[0.95, 0.05]]
}Check server health:
curl http://localhost:8005/healthView Prometheus metrics:
curl http://localhost:8005/metrics- Either in DEBUG_MODE=True in .env watchout the logs in /logs
- checkout with /metrics and use prometheus and grafana
Run tests with pytest. There's a load_test.py for stress testing.
- During the rush hours implement circuit breaker properly, and activate double queueing
- During the idle hourse make sure that small batch sizes are handled properly
- Implement auth if needed
Thats it, keep it simple!