API-Gateway

One endpoint. Any provider. Every free tier. ~1.7B tokens per month.

Aggregate free tiers from Google, Groq, Cerebras, NVIDIA, Mistral, OpenRouter, GitHub Models, Cohere, Cloudflare, HuggingFace, Z.ai (Zhipu), Ollama, Kilo, Pollinations, LLM7, OVH AI Endpoints, and OpenCode Zen — plus any OpenAI-compatible endpoint you bring (llama.cpp, LM Studio, vLLM, local Ollama, or any remote service with an API key) — behind a single /v1/chat/completions endpoint. Every provider gets the same fallback chain, the same intelligent routing, the same dashboard. Keys are stored encrypted. A Thompson-sampling bandit picks the best available model for each request, falls over to the next provider when one is rate-limited, and tracks per-key usage so you stay under every cap. If you can reach it over HTTP and it speaks OpenAI, API-Gateway routes to it.

This is a fork of tashfeenahmed/freellmapi. We add custom provider CRUD, per-provider rate limiting & parallel gating, model auto-discovery, editing of all models (including built-ins), and per-key exhaustion recovery — all detailed in What this fork adds. We track upstream weekly and merge cleanly.

What this fork adds

This fork tracks tashfeenahmed/freellmapi (upstream) and merges every release. On top of that, we ship features that turn API-Gateway from a weekend hack into a daily driver.

Head-to-head comparison: what the original lacks and how this fork solves it

Capability	freellmapi (original)	api-gateway (this fork)	What this means for you
Add your own providers	❌ Stuck with the built-in list. No way to add a custom endpoint.	✅ Add any OpenAI-compatible endpoint — a cloud service, a local Ollama on your LAN, a vLLM server on your homelab. Give it a name, a URL, and optional rate limits.	You can bring your paid APIs, your local models, and any free-tier service you discover — all routed through the same smart fallback chain, no separate tools needed.
Per-provider rate limits & parallel gating	❌ No per-provider caps. All providers share a single generic rate limiter, so a slow endpoint can hog slots and a fragile free tier can get hammered.	✅ Set RPM, RPD, TPM, TPD caps per provider plus a `maxParallelRequests` ceiling. Each provider gets its own guardrails.	A local Raspberry Pi model won't starve Groq. A free-tier endpoint with a 4-RPM cap won't get flooded into uselessness. You define the limits; the router enforces them.
Model auto-discovery	❌ Manual. You copy-paste model IDs one by one, or edit JSON by hand.	✅ Register a provider and it automatically pulls `/v1/models`, seeds every model with sensible defaults, and slots them into your fallback chain.	No JSON editing. No guessing model IDs. One click and all models from your endpoint are ready to use.
Edit any model's properties	❌ Built-in models are read-only. You can't adjust intelligence rank, speed, context window, or capabilities.	✅ Every model — built-in or custom — is editable from the dashboard. Intelligence rank, speed, size, context window, max tokens, tools/vision flags, rate limits — all changeable.	If you know your local Mixtral runs smarter than its default score, bump it. If a provider silently shrinks a context window, fix it. Changes take effect immediately, no restart needed.
Model visibility matches reality	❌ `/v1/models` returns everything — even models with no keys, dead endpoints, or ones you disabled weeks ago.	✅ `/v1/models` only returns models that are in your fallback chain and have at least one active, healthy key.	Your IDE's model picker only shows models you can actually call. No more errors from picking a model you can't reach.
Key exhaustion recovery	❌ Retries once, then gives up. When every key hits its daily quota, the endpoint is dead until you manually intervene.	✅ Three retries per key on transient failures. When all keys are exhausted, it drops to 1-RPM recovery mode — probing each key once per minute until one comes back. Normal operation resumes automatically.	Your IDE never sees an error. Even when every key hits its daily cap, the system patiently waits and recovers on its own. You might not even notice.
How it routes differently	❌ Static priority chain — always tries provider A, then B, then C.	✅ Thompson-sampling bandit that learns which providers actually deliver. Balanced, smartest-first, fastest-first, and reliability-first presets. Or fall back to classic priority ordering.	The router gets smarter over time. It favors providers that are fast and reliable, and avoids ones that are failing. Your requests hit the best model available right now, not just whatever is top of a handwritten list.
How it handles rate limits differently	❌ Generic cooldowns per model — a single 429 can bench a provider for everyone.	✅ Per-key rate tracking (RPM/RPD/TPM/TPD) before the request goes out, not after. Keys with capacity get used; keys at their limit get skipped. Cooldowns are per-key, not per-model.	You can have 5 Gemini keys and the router will cycle through them intelligently instead of panicking when the first one hits a limit. Other keys for the same model stay available.

Also different from the original

Difference	freellmapi (original)	api-gateway (this fork)	Why it's better
Encryption	AES-256-GCM for API keys at rest (same).	Same encryption, plus sensitive error messages from providers are automatically redacted before they reach your client.	Provider error responses sometimes leak key fragments, account IDs, or internal URLs. The fork strips those before you ever see them.
Tool call handling	Passes tool calls through.	Passes tool calls through plus includes automatic repair for common JSON Schema mismatches and a rescue system for inline tool-call dialects some providers emit.	Fewer broken tool loops. Models that emit slightly malformed tool calls get automatically corrected instead of failing.
Context handoff on model switch	❌ Not present.	✅ Optional. When a session falls over to a different model, injects a compact system message so the new model knows it's continuing an existing task instead of starting fresh.	No more "let me start over" when the router switches models mid-conversation. The new model picks up where the last one left off.
Custom provider management	❌ Not available.	✅ Full CRUD from the dashboard. Add, edit, delete custom providers. Auto-discover models on creation. Delete cascades cleanly (removes models, keys, and fallback entries).	Your entire setup — built-in and custom — is managed from one place, with one UI, using one workflow.
Dashboard capability	Keys, Fallback Chain, Playground, Analytics.	All of the above plus model editing, custom provider management, embeddings family configuration, and settings page.	Everything you need is in the dashboard. You rarely touch config files or a terminal.

We chose this path because we use API-Gateway as a daily driver — not as a weekend project. These features make it faster to set up, harder to break, and more resilient when providers inevitably flake. If you're evaluating API-Gateway for actual daily use, you want this fork.

Why this exists

Every serious AI lab now offers a free tier — millions of tokens a month, thousands of requests a day. On its own each tier is a toy. Stacked together, they add up to roughly 1.7 billion tokens per month of working inference capacity, across 100+ models from small-and-fast to frontier-class reasoning.

The problem is that stacking them by hand is painful: a dozen different SDKs, a dozen different rate limits, a dozen places a request can fail. API-Gateway collapses all of that — every free tier, every custom endpoint, every local model — into one OpenAI-compatible endpoint. Point any OpenAI client library at your local server, and it routes transparently across whichever providers you've added keys for.

And when the built-in list isn't enough? You add your own. Any OpenAI-compatible HTTP endpoint becomes a provider in under a minute.

Supported providers

Google Gemini 2.5 Flash · 3.x previews	Groq Llama 3.3, Llama 4, GPT-OSS, Qwen3	Cerebras Qwen3 235B	OpenCode Zen DeepSeek V4 Flash · Nemotron (promo)
Mistral Large 3 · Medium 3.5 · Codestral · Devstral	OpenRouter 21 free-tier models	GitHub Models GPT-4.1 · GPT-4o	Cloudflare Kimi K2 · GLM-4.7 · GPT-OSS · Granite 4
Cohere Command R+ · Command-A (trial)	Z.ai (Zhipu) GLM-4.5 · GLM-4.7 Flash	NVIDIA NIM · 40 RPM free (eval-only ToS)	HuggingFace Router → DeepSeek V4 · Kimi K2.6 · Qwen3
Ollama Cloud GLM-4.7 · Kimi K2 · gpt-oss · Qwen3	Kilo Gateway :free routes (anon ok)	Pollinations GPT-OSS 20B (anon ok)	LLM7 GPT-OSS · Llama 3.1 · GLM (anon ok)
OVH AI Endpoints Qwen3.5 397B · GPT-OSS · Llama 3.3 (anon ok)

Don't see yours? Add it. Any OpenAI-compatible endpoint — cloud service, local server, homelab GPU — becomes a provider in under a minute. It gets the same fallback chain, the same intelligent routing, the same rate-limit protection as every built-in. See how →

Features

OpenAI-compatible — POST /v1/chat/completions and GET /v1/models work with the official OpenAI SDKs and any OpenAI-compatible client (LangChain, LlamaIndex, Continue, Cline, Codex, Claude Code via CC Switch, etc.). Just change base_url.
Responses API — POST /v1/responses (the wire format current Codex CLI versions require) is implemented as a translating shim over the same router, with full streaming events and tool calls.
Streaming and non-streaming — Server-Sent Events for stream: true, JSON response otherwise. Every provider adapter implements both.
Tool calling — OpenAI-style tools / tool_choice requests are passed through, and assistant tool_calls + tool role follow-up messages round-trip across providers.
Embeddings — /v1/embeddings with family-based routing: failover only ever happens between providers serving the same model (vectors from different models are incompatible), never across models. See Embeddings.
Intelligent automatic fallover — If the chosen provider returns a 429, 5xx, or times out, the router skips it, puts the key on a short cooldown, and retries on the next model in your fallback chain. No manual intervention, no dropped requests.
Per-key rate tracking — RPM, RPD, TPM, and TPD counters per (platform, model, key) so the router always picks a key that's under its caps. You never accidentally blow through a free tier's daily limit at 9 AM.
Per-key exhaustion recovery — Three retries per key on transient errors. When every key for a model is exhausted, the router doesn't give up — it drops to 1-RPM recovery mode, cycling through keys until one succeeds, then resumes normal operation. Your IDE never sees an error.
Provider-level parallel request gating — Cap concurrent requests per provider so a slow endpoint never starves faster ones of connection slots. Your Groq connection stays fast even when a local model is chewing on a long prompt.
Thompson-sampling bandit routing — The router learns which providers actually deliver and routes accordingly. Balanced, smartest-first, fastest-first, and reliability-first presets. Or fall back to classic manual priority ordering anytime.
Sticky sessions — Multi-turn conversations keep talking to the same model for 30 minutes to avoid the hallucination spike that comes from mid-conversation model switches.
Encrypted key storage — API keys are encrypted with AES-256-GCM before hitting SQLite; decryption happens in-memory just before a request.
Unified API key — Clients authenticate to your proxy with a single api-gateway-… bearer token. You never expose upstream provider keys to your apps.
LAN auto-trust — API-Gateway is a single-user tool, so the admin UI and /api/* routes skip the login form whenever the request comes from the local machine (loopback, RFC1918, link-local, IPv6 ULA / link-local). Remote callers still need an email + password account (scrypt-hashed, session-token auth), set on first run. The /v1 proxy keeps its own unified-key auth for apps.
Health checks — Periodic probes mark keys as healthy, rate_limited, invalid, or error so the router skips dead ones automatically.
Admin dashboard — React + Vite UI to manage keys, reorder the fallback chain, edit any model's properties, register custom providers (models are auto-discovered on creation), inspect analytics, and run prompts in a playground. Dark mode included.
Analytics — Per-request logging with latency, token counts, success rate, and per-provider breakdowns over 24h / 7d / 30d windows.
Context handoff on model switch — Optional. When a session falls over to a different model, injects one compact system message so the new model knows it is continuing an existing task. Disabled by default; enable with API_GATEWAY_CONTEXT_HANDOFF=on_model_switch. See Context Handoff.
Full configuration export / import — One JSON file carries the entire gateway configuration: every model and its ranks/limits/capabilities, the fallback chain order, all custom providers and their base URLs, every API key (with optional passphrase encryption under PBKDF2-SHA256-310k + AES-256-GCM), embedding family routing, routing strategy / weights / retry limits, and your quirks. The Settings page lets you pick sections, optionally protect keys with a passphrase, preview the import as a dry-run diff, then apply it. Three modes: skip-existing (safest default), overwrite, and replace. Envelopes are versioned, Zod-validated, and the entire import commits atomically (rollback on any error).

Not yet supported

The scope is deliberately narrow. If a feature isn't on this list and isn't below, assume it isn't there yet.

Image generation (/v1/images/*)
Audio / speech (/v1/audio/*)
Legacy completions (/v1/completions) — only the chat endpoint is implemented
Moderation (/v1/moderations)
n > 1 (multiple completions per request)
Per-user billing / multi-tenant auth — single-user by design

PRs that add any of these are very welcome. See Contributing.

Quick start

Prerequisites: Node.js 20+, npm.

git clone https://github.com/MLuqmanBR/api-gateway.git
cd api-gateway
npm install
cp .env.example .env

# Generate an encryption key for at-rest key storage
ENCRYPTION_KEY="$(node -e 'console.log(require("crypto").randomBytes(32).toString("hex"))')"
printf "ENCRYPTION_KEY=%s\nPORT=3001\n" "$ENCRYPTION_KEY" > .env

npm run dev

Open http://localhost:5173 (the Vite dev UI), add your provider keys on the Keys page, reorder the Fallback Chain to taste, and grab your unified API key from the Keys page header. That unified key is what you point your OpenAI SDK at.

Reaching the dev UI from another device on your LAN? Use npm run dev:lan — it passes --host through to Vite, which then prints a Network: http://<your-ip>:5173 URL you can open from a phone or another machine. (Plain npm run dev -- --host does not work here: the root dev script is a concurrently wrapper, so the flag never reaches Vite.) API calls go through Vite's dev proxy, so no extra server config is needed.

For a production build:

npm run build
node server/dist/index.js     # server + dashboard both served on :3001

ENCRYPTION_KEY is required for startup. The server only falls back to a database-stored development key when DEV_MODE=true and NODE_ENV is not production; do not use that fallback with real provider keys.

Request analytics are retained for 90 days or 100000 request rows by default, whichever limit prunes first. Set REQUEST_ANALYTICS_RETENTION_DAYS=0 or REQUEST_ANALYTICS_MAX_ROWS=0 in .env to disable either retention limit.

Using the API

Any OpenAI-compatible client works. Examples:

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="api-gateway-your-unified-key",
)

resp = client.chat.completions.create(
    model="auto",  # let the router pick; or specify e.g. "gemini-2.5-flash"
    messages=[{"role": "user", "content": "Summarise the fall of Rome in one sentence."}],
)
print(resp.choices[0].message.content)
print("Routed via:", resp.headers.get("x-routed-via"))

curl

curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer api-gateway-your-unified-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "hi"}]
  }'

Streaming

stream = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Stream me a haiku about SQLite."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tool calling

Pass OpenAI-style tools and tool_choice; the assistant response round-trips back through the proxy exactly like the OpenAI API. Multi-step flows (assistant tool_calls → tool role follow-up → final answer) work across every provider the router can reach.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

# 1. Model asks for a tool call
first = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
)
call = first.choices[0].message.tool_calls[0]

# 2. You execute the tool, feed the result back
final = client.chat.completions.create(
    model="auto",
    messages=[
        {"role": "user", "content": "What's the weather in Karachi?"},
        first.choices[0].message,
        {"role": "tool", "tool_call_id": call.id, "content": '{"temp_c": 32, "cond": "sunny"}'},
    ],
    tools=tools,
)
print(final.choices[0].message.content)

Vision / image input

Send images with the standard OpenAI image_url content blocks (base64 data: URLs or http(s) URLs). When a request contains an image, the router restricts itself to vision-capable models and ignores text-only ones. Vision models are tagged with a Vision badge on the Fallback Chain page.

resp = client.chat.completions.create(
    model="auto",  # auto-routes to a vision model
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,<...>"}},
        ],
    }],
)
print(resp.choices[0].message.content)

If no vision-capable model is enabled in your Fallback Chain, an image request returns a clear 422 (code: "no_vision_model") rather than silently dropping the image.

Every response carries an X-Routed-Via: <platform>/<model> header so you can see which provider actually served each call. If a request fell over between providers, you'll also see X-Fallback-Attempts: N.

Embeddings

/v1/embeddings is OpenAI-compatible, with one deliberate difference from chat routing: failover never crosses models. Vectors from different models live in incompatible spaces — silently switching models would corrupt any vector store built on top of the proxy. So embeddings route by family (one model identity + dimension), and failover only walks the providers serving that same family.

resp = client.embeddings.create(
    model="auto",          # default family; or a family name like "bge-m3"
    input=["the quick brown fox", "pack my box with five dozen liquor jugs"],
)
print(len(resp.data), "vectors of", len(resp.data[0].embedding), "dims")

curl http://localhost:3001/v1/embeddings \
  -H "Authorization: Bearer api-gateway-your-unified-key" \
  -H "Content-Type: application/json" \
  -d '{"model": "auto", "input": "hello world"}'

model accepts auto (the configured default family), a family name, or a provider-specific model id (which resolves to its family). Available families:

Family (`model`)	Dims	Providers (failover order)
`gemini-embedding-001` (default)	3072	Google
`text-embedding-3-large`	3072	GitHub Models
`text-embedding-3-small`	1536	GitHub Models
`embed-v4.0`	1536	Cohere
`bge-m3`	1024	Cloudflare → Hugging Face
`qwen3-embedding-0.6b`	1024	Cloudflare
`nv-embedqa-e5-v5`	1024	NVIDIA
`llama-nemotron-embed-1b-v2`	2048	NVIDIA
`llama-nemotron-embed-vl-1b-v2`	2048	NVIDIA → OpenRouter
`embeddinggemma-300m`	768	Cloudflare

The default family, per-provider toggles, and priorities live on the dashboard's Models → Embeddings page.

Custom platforms and models

The built-in provider list is a starting point, not a boundary. From the Keys page, the Platforms grid is the unified catalog — every built-in platform you've added a key for, alongside every custom platform you've registered. The grid ends with an Add New Platform tile that opens a modal for:

Slug — a short identifier like my-ollama (lowercase letters, digits, dashes; 2-32 chars; cannot collide with a built-in).
Display name — shown in the dashboard.
Base URL — the OpenAI-compatible endpoint, e.g. http://192.168.1.10:11434/v1.
Rate limits (optional) — RPM, RPD, TPM, TPD caps enforced per-provider. A local Ollama that can handle 200 RPM gets different treatment than a free-tier cloud endpoint limited to 4.
Max parallel requests (optional) — concurrency ceiling so this provider never hogs all connection slots.

Once a platform exists, its models are auto-discovered from the endpoint's /v1/models during creation. You can also re-run discovery at any time via POST /api/custom-providers/:slug/sync-models. Or use the Add a model form to register models manually:

Model ID and Display name — required.
Context window, Supports tools, Supports vision — basic flags.
Advanced toggle exposes intelligence rank, speed rank, size label, per-model rate limits, and max output tokens.

The model joins the fallback chain at the lowest priority and shows up everywhere built-in models do — /v1/models, the Fallback page, the Analytics page. You can edit any model (built-in or custom) later: adjust ranks, toggle tools/vision, cap output tokens, change rate limits — all from the dashboard.

Adding an API key for a custom platform works the same as for a built-in: pick the custom slug in the Add a provider key form, paste the bearer (or leave blank for local servers that don't need one), and the key routes to your endpoint.

Removing a custom platform cascades — it drops every model on that platform, every key, and every fallback entry. There's no "leaving a model orphaned" state. Removing a custom platform cascades — it drops every model on that platform, every key, and every fallback entry. There's no "leaving a model orphaned" state.

Settings & backup

The Settings page is where you back up the entire gateway configuration and restore it elsewhere — useful when you promote a curated setup from your laptop to a server, sync tweaks between staging and production, or wipe and recover from a disaster. One versioned JSON envelope carries:

Every model in the catalog, with its ranks, capabilities, per-model rate limits, context window, max output tokens, and vision / tools flags.
The full fallback chain (ordered list of (platform, model_id) pairs with their enabled state).
All custom providers and their base URLs, rate limits, max-parallel ceiling, keyless / api-format flags.
Every API key — encrypted under a passphrase you supply at export time, so the file is safe to commit to a private repo, drop in a chat, or copy via scp.
Embedding family configuration (per-family provider order, dimensions, default family).
Routing strategy, custom weights, and global retry limit.
Your authored quirks (title / body / severity / per-target model globs).

Three endpoints sit behind /api/config (gated by the dashboard session):

Endpoint	Purpose
`GET /api/config/inventory`	Row counts per exportable section. The Settings page calls this on load to populate the "Current configuration" card.
`POST /api/config/export`	Build an envelope. JSON body may include `sections`, `passphrase`, `label`, `download`. When `download` is true, the response carries a `Content-Disposition: attachment` header so the browser saves the file.
`POST /api/config/preview`	Parse + validate an envelope without committing. Returns the parsed section counts, schema version, generator, and whether the keys blob is encrypted.
`POST /api/config/import`	Apply an envelope. Body: `{ envelope, options: { mode, dryRun, passphrase, sections? } }`. The whole import runs inside a single SQLite transaction and rolls back atomically on any structural error.

Merge modes

skip-existing (default) — never touch rows that already exist. New rows are inserted. Safest when restoring into a populated database.
overwrite — update existing rows in place, insert the rest.
replace — wipe the destination section, then insert from the envelope.

Passphrase protection

If you supply a passphrase at export time, the envelope's keysCipher blob is a self-describing PBKDF2-SHA256-310,000-derived AES-256-GCM ciphertext holding the plaintext API keys. The api_keys section in the same envelope strips the plaintext key field — only the ciphertext fields (encryptedKey, iv, authTag) remain, so even a malicious actor with the file but no passphrase cannot recover the keys. On import, supply the same passphrase at the import side to decrypt. A wrong passphrase returns 401 with a clear "could not decrypt keysCipher" message.

If you skip the passphrase, the api_keys section includes the plaintext key directly. Fine for an offline backup to your own machine; not safe to share. The UI nudges you toward providing one.

Built-in models

Catalog rows for built-in platforms (google, groq, cerebras, …) are normally read-only on import — they're seeded by the server and shouldn't drift between instances. If you intentionally want to carry rank tweaks across installs, set overwriteBuiltin: true on the affected model record. Custom-provider models are never locked.

Dry-run

Every import can be sent with dryRun: true. The server runs the exact same code path inside a SQLite SAVEPOINT, then rolls back. The response includes a diff summary — for each section: how many records were added, updated, skipped, or removed, plus any per-record errors. Always dry-run first when applying against a populated database; the UI's Run dry-run button is one click away from Apply import.

Screenshots

Keys

Manage provider credentials and grab the unified API key your apps connect with. Each key shows a status dot and when it was last health-checked. Custom platforms appear alongside built-in ones in the unified grid.

Playground

Send a chat completion through the router and see which provider served it, with the model ID and latency printed right on the message.

Analytics

Request volume, success rate, tokens in and out, average latency, and per-provider breakdowns over 24h / 7d / 30d windows.

How it works

┌──────────────────┐   Bearer api-gateway-…   ┌─────────────────────────┐
│  OpenAI SDK /    │ ──────────────────────▶ │  Express proxy (:3001)  │
│  curl / any      │ ◀────────────────────── │  /v1/chat/completions   │
│  OpenAI client   │      streamed tokens    └────────────┬────────────┘
└──────────────────┘                                      │
                                                          ▼
                             ┌────────────────────────────────────────────────┐
                             │  Router                                        │
                             │   1. Pick highest-priority model that          │
                             │      (a) has a healthy key and                 │
                             │      (b) is under all its rate limits.         │
                             │   2. Decrypt key, call provider SDK.           │
                             │   3. On 429/5xx → cooldown + retry next model. │
                             │   4. On key exhaustion → cycle keys, 1-RPM.    │
                             └────────────────────────────────────────────────┘
                                          │
   ┌──────────────┬────────────┬──────────┴─────────┬─────────────┬──────────┐
   ▼              ▼            ▼                    ▼             ▼          ▼
 Google         Groq        Cerebras           OpenRouter        HF       …and more
                                +
                        Custom Providers
                    (any OpenAI-compatible endpoint)

Router (server/src/services/router.ts) — picks a model per request using Thompson-sampling bandit scoring or manual priority chain.
Scoring (server/src/services/scoring.ts) — Thompson-sampling bandit: reliability × speed × intelligence, with headroom and rate-limit guardrails.
Rate-limit ledger (server/src/services/ratelimit.ts) — in-memory RPM/RPD/TPM/TPD counters backed by SQLite, with cooldowns on 429s.
Key exhaustion (server/src/services/key-exhaustion.ts) — per-key 3-retry → key cycling → 1-RPM recovery mode.
Provider adapters (server/src/providers/*.ts) — one file per provider, implementing the Provider base class: chatCompletion() and streamChatCompletion(). Custom providers are resolved from the custom_providers table at request time.
Health service (server/src/services/health.ts) — periodic probe keeps key status fresh.
Dashboard (client/) — React + Vite + shadcn/ui admin surface.
Storage — SQLite (better-sqlite3) with AES-256-GCM envelope encryption for keys.

Context Handoff

When API-Gateway falls over to a different model mid-conversation (quota, rate limit, cooldown), the new model has no idea it is picking up someone else's task. Context handoff adds a single compact system message to the outbound request that tells the new model exactly that:

API-Gateway context handoff:
You are taking over an ongoing conversation from another model (groq:llama-3 → google:gemini-flash).
Continue the user's task using the conversation context already provided in this request.
Do not restart the task, re-ask already answered setup questions, or discard prior tool results.
Respect the user's latest message as the highest-priority instruction.

Recent session summary:
User: …
Assistant: …

Enable it in .env:

API_GATEWAY_CONTEXT_HANDOFF=on_model_switch

How it works:

Messages per session are stored in memory (TTL: 3 hours).
Only injected when the selected model changes for a given session key.
Not injected on the first request, on same-model continuations, or if a handoff message is already present.
Session key: X-Session-Id header if present, otherwise SHA-1 of the first user message (same as sticky sessions).
Storage is in-memory only. Nothing is written to disk or logged.

Important: Context Handoff improves continuity for conversations routed through API-Gateway. It cannot recover provider-internal hidden state or messages that were never sent to the proxy.

Limitations

Stacking free tiers — even with custom providers in the mix — has real trade-offs. Be honest with yourself about them:

No frontier models out of the box. The free-tier catalog tops out around Llama 3.3 70B, GLM-4.5, Qwen 3 Coder, and Gemini 2.5 Pro. You will not get GPT-5 or Claude Opus class reasoning through the built-in providers. For hard problems, pay for a real API — or bring your own paid provider as a custom platform.
Intelligence degrades as the day progresses. Your top-ranked models (usually Gemini 2.5 Pro, GPT-4o via GitHub Models) have the lowest daily caps. Once they hit their limits, the router falls down your priority chain to smaller/weaker models. Expect the effective intelligence of the endpoint to drop in the late hours of each day — then reset at UTC midnight.
Latency is highly variable. Cerebras and Groq are extremely fast; others are not. You get whichever one is available at the moment.
Free tiers can change without notice. Providers regularly tighten, loosen, or remove free tiers. When that happens you'll see 429s or auth errors until the catalog catches up. Re-seed scripts live in server/src/scripts/.
No SLA, by definition. If you need reliability, use a paid provider with a contract — either directly or plugged into API-Gateway as a custom platform.
Local-first. There's no multi-tenant auth. Run this for yourself; don't expose it to the internet.

Contributing

Contributors very welcome! Good first PRs:

Add a provider — copy server/src/providers/openai-compat.ts as a template, wire it into server/src/providers/index.ts, seed its models in server/src/db/migrations.ts, add a test in server/src/__tests__/providers/.
Add an endpoint — images, moderations, audio. The provider base class can grow new methods; adapters declare which they support.
Improve the router — cost-aware routing (cheapest-healthy-fastest tradeoffs), better latency-weighted priority, regional pinning.
Dashboard polish — charts on the Analytics page, key rotation UX, batch import of keys from .env.
Docs — more examples, client library snippets for Go/Rust/etc.

Development loop:

npm install
npm run dev      # server on :3001, dashboard on :5173, both with HMR
npm test         # server vitest; also runs client tests if the workspace adds them
npm run build    # compile server and dashboard

PRs should include a test, keep the existing test suite green, and match the .editorconfig / tsconfig defaults already in the repo. Issues and discussions are open.

Contributors

Terms of Service review

A self-hosted, single-user, personal-use setup was re-reviewed against each provider's ToS (May 2026). Summary:

Provider	Verdict	Notes
Google Gemini	⚠️ Caution	March 2026 ToS narrows scope to "professional or business purposes, not for consumer use" — a self-hosted developer proxy is still defensible, but the clause is new.
Groq	✅ Likely OK	GroqCloud Services Agreement permits Customer Application integration.
Cerebras	✅ Likely OK	Permitted; explicitly forbids selling/transferring API keys.
Mistral	✅ Likely OK	APIs allowed for personal/internal business use.
OpenRouter	✅ Likely OK	April 2026 ToS sharpens the no-resale / no-competing-service clause; private single-user proxy still fine.
Cloudflare Workers AI	⚠️ Ambiguous	No anti-proxy clause; covered by general Self-Serve Subscription Agreement.
NVIDIA NIM	⚠️ Caution	Trial ToS §1.2 / §1.4: "evaluation only, not production." Free access is a recurring 40 RPM rate limit (the 2025 credit system was discontinued), but the evaluation-only scope stands.
GitHub Models	⚠️ Caution	Free tier explicitly scoped to "experimentation" and "prototyping."
Cohere	❌ Avoid	Terms §14 still forbids "personal, family or household purposes."
Zhipu (open.bigmodel.cn)	✅ Likely OK	Personal/non-commercial research carve-out still in the platform docs.
Z.ai (api.z.ai)	⚠️ Caution	Singapore entity (distinct from Zhipu CN). §III.3(l) anti-traffic-redirect clause could plausibly be read against a proxy; no explicit personal-use carve-out.
Ollama Cloud	✅ Likely OK	Free plan permits cloud-model access (1 concurrent, 5-hour session caps). No anti-proxy / anti-resale clauses found.
OVH AI Endpoints	✅ Likely OK	Anonymous access is officially documented (2 req/min per IP per model). OVH reserves the right to introduce token/consumption caps.

Rules of thumb that keep most providers happy: one account per provider, no reselling, no sharing your endpoint with other humans, don't hammer a free tier as a paid production backend. This is informational, not legal advice — read each provider's ToS and make your own call.

Disclaimer

This project is for personal experimentation and learning, not production. Free tiers exist so developers can prototype against them; they aren't a stable, supported inference substrate and shouldn't be treated as one. If you build something real on top of API-Gateway, swap in a paid API before you ship. Your relationship with each upstream provider is governed by the terms you accepted when you created your account — those terms still apply when the traffic is proxied through this project, and you're responsible for complying with them.

Built on tashfeenahmed/freellmapi. Fork maintained by MLuqmanBR.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 313 Commits
.github/workflows		.github/workflows
client		client
docs		docs
repo-assets		repo-assets
scripts		scripts
server		server
shared		shared
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RULES.md		RULES.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API-Gateway

Contents

What this fork adds

Head-to-head comparison: what the original lacks and how this fork solves it

Also different from the original

Why this exists

Supported providers

Features

Not yet supported

Quick start

Using the API

Embeddings

Custom platforms and models

Settings & backup

Merge modes

Passphrase protection

Built-in models

Dry-run

Screenshots

Keys

Playground

Analytics

How it works

Context Handoff

Limitations

Contributing

Contributors

Terms of Service review

Disclaimer

License

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

API-Gateway

Contents

What this fork adds

Head-to-head comparison: what the original lacks and how this fork solves it

Also different from the original

Why this exists

Supported providers

Features

Not yet supported

Quick start

Using the API

Embeddings

Custom platforms and models

Settings & backup

Merge modes

Passphrase protection

Built-in models

Dry-run

Screenshots

Keys

Playground

Analytics

How it works

Context Handoff

Limitations

Contributing

Contributors

Terms of Service review

Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages