GitHub - y-akbal/SchwiftServe: Inference, but schwifty!

SchwiftServe

Alright we do inference but do it schwifty.

A simple model serving for lite ML models. Batches requests, handles back pressure, circuit breaker stuff (not active). Keeps things from blowing up when too many requests come in.

Supported Backends

Anything pickled with joblib (including many sklearn models)
Pure python mode (only for those that are not picklable, or some numba related stuff)
ONNX runtime, this is the place where the whole stack shines schwiftly!
Torch mode (only cpu based models) --> jit compile and run!

Quick Start

Clone this repo
Put your model into models/model_name/{model_file + model_config.yaml}
Run with docker: docker compose up --build

Config

Check the models/ folder for config files. You can tweak batch sizes and stuff there.

To do inference

Check Available Models

First check if your model is alive and kicking:

curl -f http://localhost:8005/available_models

Inference Endpoints

There are two endpoints: /predict (immediate) and /predict_batched (batched, ~10ms delay for better throughput)

1. `/predict/{model_name}` - Immediate Prediction (May be already batched)

Request Format:

curl -X POST http://localhost:8005/predict/{model_name} \
  -H "Content-Type: application/json" \
  -d '{"features": [[1.0, 2.0, 3.0]]}'

Request Body:

{
  "features": [[1.0, 2.0, 3.0]]
}

Response:

{
  "model": "model_name",
  "predictions": [[0.95, 0.05]]
}

Example (sklearn model):

curl -X POST http://localhost:8005/predict/my_model \
  -H "Content-Type: application/json" \
  -d '{"features": [[5.1, 3.5, 1.4, 0.2], [7.2, 3.2, 6.0, 1.8]]}'

2. `/predict_batched/{model_name}` - Batched Prediction (Better for throughput)

Same input format as /predict, but requests are batched together for better efficiency. You may wait a bit (<= 10ms) while waiting for other requests to batch together. For this endpoint no requests that are already batched are allowed!

Request Format:

curl -X POST http://localhost:8005/predict_batched/{model_name} \
  -H "Content-Type: application/json" \
  -d '{"features": [[1.0, 2.0, 3.0]]}'

Response:

{
  "model": "model_name",
  "predictions": [[0.95, 0.05]]
}

Health Check & Metrics

Check server health:

curl http://localhost:8005/health

View Prometheus metrics:

curl http://localhost:8005/metrics

Checking latency related stuff:

Either in DEBUG_MODE=True in .env watchout the logs in /logs
checkout with /metrics and use prometheus and grafana

Testing

Run tests with pytest. There's a load_test.py for stress testing.

TODO

During the rush hours implement circuit breaker properly, and activate double queueing
During the idle hourse make sure that small batch sizes are handled properly
Implement auth if needed

Thats it, keep it simple!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
models		models
src		src
tests		tests
.env_example		.env_example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SchwiftServe

Supported Backends

Quick Start

Config

To do inference

Check Available Models

Inference Endpoints

1. `/predict/{model_name}` - Immediate Prediction (May be already batched)

2. `/predict_batched/{model_name}` - Batched Prediction (Better for throughput)

Health Check & Metrics

Checking latency related stuff:

Testing

TODO

About

Uh oh!

Releases

Packages

Languages

License

y-akbal/SchwiftServe

Folders and files

Latest commit

History

Repository files navigation

SchwiftServe

Supported Backends

Quick Start

Config

To do inference

Check Available Models

Inference Endpoints

1. /predict/{model_name} - Immediate Prediction (May be already batched)

2. /predict_batched/{model_name} - Batched Prediction (Better for throughput)

Health Check & Metrics

Checking latency related stuff:

Testing

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `/predict/{model_name}` - Immediate Prediction (May be already batched)

2. `/predict_batched/{model_name}` - Batched Prediction (Better for throughput)

Packages