(not an acronym)
This is SLaB(Smart Load Balancer) 2.0 and I am calling it Magnetar. SLaB 1.0 was very close to being a viable MVP but it couldn't. Now I am trying again with a refined architecture, hope it succeeds.
Current first draft vs. where we landed (old on the left, new on the right):
TL;DR: We started with a simple CF Worker router spraying requests + collecting 5xx via a "collector" worker and dumping into Postgres. The collector is just an abstraction, all workers directly publish telemetry to a telemetry topic in Kafka and communicate via kafka http bridge.
Now telemetry goes straight from the router → Kafka REST proxy → Go learner (Thompson Sampling) → router. Redis stays only for round robin state, and learners consume Kafka instead of relying on a fragile Redis buffer.
Note: Ngrok works for exposing the router but it rate limits at ~400 req/min, so I mostly run the router locally for stress tests. (Test bed still has ngrok URLs baked in if you want to try it.)
cdinto theworkerdirectory- Rename
wrangler-example.jsonctowrangler.jsonc - Change the default config as per your requirements
- Run
npm run deployfor whatever number of workers you need - Copy the deployed URL and paste it in
config/workers.json
Note: Please change the name in wrangler.jsonc before deploying another worker, as it will just override your previous worker.
cdintorouter- Copy
wrangler-example.jsonctowrangler.jsoncif needed - Update the
varsblock for your own Kafka REST bridge / Redis / learner URL (for local dev we just point tohttp://localhost:8082+http://localhost:8090) - Run
pnpm i(or whatever you use), thenpnpm dev/pnpm deploy
Router now publishes telemetry for every request to Kafka → topic telemetry.
cd kafka && docker compose up -d- Please actually run that compose file—everything assumes those containers are up
This spins up:
- Kafka broker (exposes
localhost:29092for host clients,magnetar-kafka:9092for containers) - Kafka-rest on
http://localhost:8082 - Kafka-ui on
http://localhost:8080
- Open Kafka UI (port 8080) and create the
telemetrytopic manually (router publishes there, learner consumes it)
cd learner- Run
go run .(default env already points tolocalhost:29092+:8090)
Learner exposes:
GET /recommendation(router hits this whenalgo=proprietary)
If learner is down, router falls back to round-robin/random so nothing explodes.
cd test-bed- Run
pnpm i - Run
pnpm dev
This UI lets you set failure sliders per worker and run stress tests to compare algorithms (see screenshots in docs/images).
Round-robin with two workers at 100% fail + one at 90% fail lands around 3% success:
Thompson Sampling shifts traffic toward the least-bad worker and sits closer to 10% success under the same setup:
Workers in percentage_fail mode literally flip a coin per request (Math.random() * 100 < PERCENTAGE_FAIL), so the proprietary line wiggles a few points because it keeps hammering the same "least awful" worker. Round-robin stays glued to its theoretical limit because it splits traffic evenly (two workers contribute 0%, the third contributes whatever its slider says).



