Releases: Devansh-365/freellm
v1.5.1: Gemini reasoning fix, NIM json_schema, truncation warning
Fixes a user-reported bug where Gemini 2.5 Flash returned ~30 tokens
regardless of how high max_tokens was set, hardens the response cache
against poisoning by truncated responses, and adds JSON mode
reliability improvements for NVIDIA NIM and all providers.
Fixed
Gemini 2.5 reasoning budget no longer eats max_tokens
Gemini 2.5 Flash and 2.5 Pro are reasoning models. With their default
thinking budget they burned 90-98% of max_tokens on internal reasoning
before producing visible text. A caller asking for max_tokens=1000
routinely got back 30-40 tokens with finish_reason=length.
The Gemini provider adapter now injects a per-model default
reasoning_effort when the caller does not set one:
- gemini-2.5-flash defaults to
"none"(accepts zero thinking budget,
returns the full requested output) - gemini-2.5-pro defaults to
"low"(the minimum Google accepts for
this model, which requires a non-zero thinking budget)
Clients that want full reasoning can pass reasoning_effort: "high"
explicitly. The adapter also normalizes the output budget onto
max_completion_tokens only and deletes max_tokens from the
outgoing request, because Gemini returns 400 when both are present.
Verified against the live Gemini API: the same prompt that produced
37 tokens before now returns 670+ tokens with finish_reason=stop.
Deprecated Gemini 2.0 models removed from catalog
gemini-2.0-flash and gemini-2.0-flash-lite both returned 404 "no
longer available to new users" from the live API. Removed from the
model list so callers cannot pick a dead model.
Response cache no longer stores truncated responses
When any choice in the upstream response carries
finish_reason=length, the cache now refuses to store it. Previously
a single truncated response from a reasoning model would pin the
bad answer for the entire TTL window (default 1 hour), causing the
"sometimes fails" pattern the user reported.
Cache key expanded to prevent cross-shape collisions
The cache key now includes tools, tool_choice, parallel_tool_calls,
response_format, reasoning_effort, seed, max_completion_tokens,
presence_penalty, and frequency_penalty. Previously two requests
with the same prompt but different tool definitions or response
formats would share a cache entry.
Added
NVIDIA NIM json_schema translation
NIM's OpenAI-compat endpoint does not support
response_format: { type: "json_schema" }. It requires the schema
in a vendor-specific nvext.guided_json field. The NIM provider
adapter now translates the standard parameter into the NIM-native
format automatically and removes the unsupported response_format
field from the outgoing request. json_object mode and requests
without response_format pass through untouched.
JSON truncation warning header
When a JSON-mode request (json_object or json_schema) hits
max_tokens and the output is almost certainly broken mid-token,
the response now carries:
X-FreeLLM-Warning: json-possibly-truncated
The caller knows immediately that the JSON is likely incomplete
without needing to attempt a parse.
reasoning_effort accepted in the request schema
The Zod schema now accepts reasoning_effort: "none" | "low" | "medium" | "high" so clients can override the per-model default.
Matches Gemini's OpenAI-compat knob and OpenAI's o-series
reasoning parameter.
finish_reason surfaced in the request log
The request log entry now carries a finishReason field populated
from the upstream response. When the reason is "length" the router
also emits a pino warning tagged with provider, model, max_tokens,
and reasoning_effort so operators can see which requests hit the
token cap.
Configuration
New environment variable:
| Variable | Default | What it does |
|---|---|---|
reasoning_effort (request field) |
Per-model | "none" for gemini-2.5-flash, "low" for gemini-2.5-pro. Clients can override per request. |
Tests
262 passing across 22 files (up from 215 at v1.5.0):
gemini-provider.test.ts14 tests for per-model defaults, mapRequest
normalization, and catalog verificationcache.test.ts17 tests for isCacheable, truncation skip, and key
discrimination across all request shape fieldsnim-provider.test.ts6 tests for json_schema-to-nvext translation,
passthrough, and field preservationjson-truncation.test.ts4 e2e tests for the warning headerschema-tools.test.ts2 new tests for reasoning_effortrouter.test.ts4 new tests for finish_reason handling and
cache anti-poisoning
v1.5.0: Browser-safe tokens and streaming correctness
The browser-safe release. FreeLLM can now be safely exposed to an app's
end users via short-lived HMAC-signed tokens that are bound to an origin
and an identifier. Streaming tool calls from Gemini and Ollama are fixed
at the gateway, the Zod schema finally accepts the full OpenAI request
shape, the dashboard surfaces the new trust layer, and every layer was
verified end-to-end with the real openai npm SDK hitting real provider
APIs.
Added
Browser-safe short-lived tokens
Operators can now mint stateless HMAC-signed bearer tokens that are
safe to ship to a browser. Integration pattern: a one-file serverless
function calls POST /v1/tokens/issue with the master or virtual key,
returns the minted token to the browser, and the browser uses the
token directly with any OpenAI SDK. No auth backend, no session
store, no database.
- New module
src/gateway/browser-token.tswith puresignBrowserToken
andverifyBrowserTokenhelpers. HMAC-SHA256 over a JSON payload,
constant-time signature comparison, base64url encoding. - Token format:
flt.<base64url(payload)>.<hex(hmac)>. - Payload v1 carries
v,iat,exp,origin, and optional
identifierandvk(virtual key id). - Max TTL 900 seconds (15 minutes), clamped at issue time.
- Origin is embedded in the token and compared against the browser's
Originheader on every verify. Mismatch = reject. FREELLM_TOKEN_SECRETmust be at least 32 bytes. Enforced both at
boot and on every sign/verify operation. A short secret is a
fatal boot failure so no path can produce a weak token.- Constant-time signature comparison via
timingSafeEqual. - Secret rotation immediately invalidates all outstanding tokens
(intentional kill switch for compromised deployments).
POST /v1/tokens/issue endpoint
Admin-auth-equivalent mint endpoint that any existing master key,
admin key, or virtual key can call. Browser tokens themselves cannot
mint new browser tokens (chain guard). When a virtual key is the
issuer, the resulting token inherits its id so the existing Phase 2
cap enforcement flows through untouched.
curl https://your-gateway/v1/tokens/issue \
-H "Authorization: Bearer $FREELLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"origin": "https://yoursite.com",
"identifier": "session-abc",
"ttlSeconds": 900
}'
# Response:
# {
# "token": "flt.eyJ2IjoxLCJpYXQiOi...",
# "expiresAt": "2026-04-09T07:15:00.000Z",
# "origin": "https://yoursite.com",
# "identifier": "session-abc"
# }Origin must be https://* or http://localhost[:port]. Identifier
must match ^[A-Za-z0-9_.:-]{1,128}$, same pattern as the per-user
rate limiter. TTL clamped to [1, 900] seconds.
Browser token authentication
The auth middleware now recognizes flt.* bearers alongside the
master key, admin key, and virtual keys. Verification checks the
signature, expiry, and Origin header atomically. On success:
req.browserTokencarries the verified payloadreq.virtualKeyis hydrated if the token carries a virtual key id- The token's identifier is copied into
X-FreeLLM-Identifierso the
existing per-user rate limiter picks it up with zero extra code - All other middleware (privacy routing, strict mode, streaming
normalizer, cap enforcement) works unchanged
On failure the middleware returns 401 invalid_api_key with the
specific rejection reason in the message (origin_mismatch, expired,
bad_signature, etc.).
Streaming tool_call normalization
Every chat completion stream now flows through a per-provider
normalizer that sits between the upstream fetch and the downstream
response. Fixes three widely-reported bugs without touching the
upstreams or the clients:
- Gemini
indexfield missing on streaming tool_call deltas: the
normalizer assigns an index per function name so argument fragments
across chunks land on the same logical tool call, and stamps
type: "function"where it's missing. Multi-tool parallel calls
get distinct indices tracked by name. - Ollama flat-argument hoisting: Ollama sometimes emits
argumentsat the top level of a tool_call rather than inside
function. The normalizer hoists it into the expected shape and
reuses the last-seen index for subsequent fragments. - Malformed chunk resilience: every transform is wrapped in
try/catch so one bad byte from upstream logs a warning and forwards
the original event verbatim instead of crashing the whole response.
New module tree under src/gateway/streaming/:
sse.tstolerant SSE parser and serializer, handles CRLF, partial
chunks,[DONE], and comment heartbeats.types.tsminimal OpenAI chunk shape plusNormalizerinterface.normalizer.tsper-provider dispatcher with passthrough default.passthrough.tsno-op for Groq, Cerebras, NIM, and Mistral.gemini.tsandollama.tsprovider-specific fixes.pipeline.tsties parser, dispatcher, and serializer together with
defensive try/catch at every stage.
Heartbeat comment \n: keep-alive\n\n is written every
STREAM_IDLE_TIMEOUT_MS (default 30s) so proxies on the path like
Railway and Cloudflare do not drop slow streams. Client disconnect
is detected via res.on('close') and cancels the upstream reader
so we do not burn provider quota into a dead socket.
Full OpenAI request shape accepted
The Zod schema and matching TypeScript types now accept the full
OpenAI Chat Completions surface, including:
toolsarray with nestedfunctiondefinition, name, description,
parameters, and strict flagtool_choiceas"none" | "auto" | "required"or a specific
function referenceparallel_tool_callsresponse_formattext / json_object / json_schema envelopestream_options.include_usagefor per-chunk usage accountingmax_completion_tokensalongsidemax_tokenspresence_penaltyandfrequency_penaltyvalidated in[-2, 2]seedanduserfor observabilitytool_call_idon tool-role messagestool_callsarray on assistant-role messagesdeveloperrole (used by newer OpenAI models)
Strict mode is preserved so genuinely unknown fields still throw.
Only the known OpenAI surface is accepted.
Caught while verifying the streaming normalizer with a real openai
npm SDK client: the previous schema rejected every tool-calling
request with 400 before it ever reached the streaming pipeline, so
the normalizer was unreachable in practice for the exact audience it
was built for. Fixed and regression-tested against the shapes the
real SDK sends.
Dashboard v1.5 surface
- Provider cards carry a color-coded Trust badge:
NO-TRAIN(emerald),LOCAL(sky),CONFIG(amber), or
TRAINS(rose). Click opens the provider's actual terms of service
URL in a new tab; hover shows the last-verified date. - Virtual Keys panel on the main dashboard page lists every
loaded virtual sub-key with per-key progress bars for daily request
cap and daily token cap (rose tint when usage crosses 90%), allowed
models surfaced as monospace badges, expired keys flagged, and a
persistent amber reminder that caps are soft. - Browser Tokens card shows a live enabled/disabled status with
an emerald pulsing dot whenFREELLM_TOKEN_SECRETis set, max TTL
and min secret length visible at a glance, a link out to the
/browser-integrationdocs page, and an amber callout on the
disabled state telling operators exactly which env var to set. GET /v1/statusresponse grew a newbrowserTokensfield
{ enabled, minSecretBytes, maxTtlSeconds }plus aprivacyblock
on everyProviderStatussourced from thePROVIDER_PRIVACY
catalog.- New admin-only endpoint
GET /v1/status/virtual-keysreturns a
masked inventory of loaded virtual sub-keys (first 12 chars plus
last 4) with cap limits, remaining usage, allowed models, expiry,
and asoftCapWarningstring reminding the caller that counters
reset on restart.
Browser integration docs and runnable example
packages/website/src/content/docs/browser-integration.mdxwalks
through the full flow: mint from a Node or Flask backend, use the
token in the browser via the official openai SDK (with
dangerouslyAllowBrowser: true), handle expiry, verify origin
binding, secret rotation, common pitfalls.- New Integration sidebar group on the Starlight website. 17
pages total, up from 16. examples/browser-chatbot/copy-paste runnable demo: 108-line
self-contained HTML chatbot using the openai npm package via
esm.sh, a 37-line Vercel serverless function that mints tokens
server-side from env vars, and a 58-line README covering setup,
local development, Vercel deploy, and security notes. No
frameworks, no build step.
Changed
authmiddleware now acceptsFREELLM_ADMIN_KEYas a valid base
credential alongside the master key and virtual keys, so the
admin_required403 path is actually reachable when an operator
runs with a distinct admin token. Previously the admin-only routes
were unreachable without also settingFREELLM_API_KEY.ProviderStatusInfogained an optionalprivacyblock populated
from thePROVIDER_PRIVACYcatalog withpolicy,sourceUrl, and
lastVerifiedfields.GatewayStatusgained abrowserTokensblock surfacing the boot
state of the feature.
Configuration
New environment variables:
| Variable | Default | What it does |
|---|---|---|
FREELLM_TOKEN_SECRET |
unset | HMAC secret for browser tokens, minimum 32 bytes. Short = fatal boot failure. Unset = browser tokens disabled, the rest of the gateway runs unchanged. |
STREAM_IDLE_TIMEOUT_MS |
30000 |
Heartbeat cadence for the SSE keep-alive comment. |
Previously shipped in v1.4.0 and still supported:
FREELLM_IDENTIFIER_LIMIT, FREELLM_IDENTIFIER_MAX_BUCKETS,
FREELLM_VIRTUAL_KEYS_PATH.
Tests
232 passing tests across 19 files (up from 146 at v1.4.0):
- `tests/browser-token.test...
v1.4.0: Honest Gateway
The honest-gateway release. FreeLLM now tells you exactly which provider
answered your request, lets you refuse silent downgrades, routes around
providers that train on your prompts, returns enriched retry hints on 429s,
and can be safely exposed to an app's end users via virtual sub-keys and
per-identifier rate limiting. Everything ships with a real test suite and
no new runtime dependencies.
Added
Transparent routing headers
Every chat completion response now carries observability headers so
clients can see exactly how the request was handled:
X-FreeLLM-Provider— the concrete provider id that served the responseX-FreeLLM-Model— the resolved concrete model idX-FreeLLM-Requested-Model— the original model asked forX-FreeLLM-Cached—truewhen the response came from the cacheX-FreeLLM-Route-Reason— one ofdirect,meta,cache,failoverX-Request-Id— a unique trace id that also appears in logs and error bodies
Strict mode
Opt-in via X-FreeLLM-Strict: true. In strict mode the router refuses
to substitute models. Meta-models (free, free-fast, free-smart)
are rejected with a clear 400. Concrete models are tried against
exactly one provider and the upstream error surfaces verbatim if that
provider fails. No silent failover, no cache hit masquerading as fresh.
Actionable 429 bodies
When all providers are exhausted, the gateway now returns a structured
body instead of a generic error:
{
"error": {
"type": "rate_limit_error",
"code": "all_providers_exhausted",
"message": "...",
"retry_after_ms": 12000,
"providers": [
{ "id": "groq", "retry_after_ms": 12000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" },
{ "id": "gemini", "retry_after_ms": 5000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" }
],
"suggestions": [
{ "model": "free-fast", "available_in_ms": 5000 },
{ "model": "free-smart", "available_in_ms": 5000 }
],
"request_id": "..."
}
}The response also carries an HTTP Retry-After header in seconds.
Unified error SDK
New src/errors/ module defines the one and only error taxonomy the
gateway emits. Fifteen concrete error codes grouped into seven types
that match OpenAI's shape (invalid_request_error, authentication_error,
permission_error, not_found_error, rate_limit_error, provider_error,
internal_error). Every middleware delegates via next(freellmError(...))
instead of writing response bodies directly, and the central error
handler funnels everything through a single toBody() serializer.
freellmError({ code, message, ...context })factoryhttpStatusFor(code)andtypeFor(code)lookup tablestoBody(err, requestId)never throws, falls back to
internal_server_errorenvelope for unknown inputredactSecrets(message)strips Bearer tokens, API-key-looking values,
and long hex sequences from error messages before they go on the wire
Request id propagation
New request-id middleware mounts first in the pipeline and assigns
every request a UUID (honors an inbound X-Request-Id matching
^[A-Za-z0-9_.:-]{1,128}$ so distributed traces can thread through).
The same id flows into the response header, the error body, and every
pino log line via genReqId so a single grep correlates access logs,
error logs, and bug reports.
Privacy and training-policy routing
New X-FreeLLM-Privacy: no-training header filters the router's
candidate list to providers that contractually exclude free-tier data
from training. Backed by a new PROVIDER_PRIVACY catalog with source
URLs and last-verified dates for every shipped provider:
| Provider | Policy |
|---|---|
| Groq | no-training |
| Cerebras | no-training |
| NVIDIA NIM | no-training |
| Ollama | local |
| Mistral | configurable |
| Gemini | free-tier trains |
When no provider can satisfy the posture for the requested model, the
gateway returns a 400 model_not_supported up front instead of
pointlessly cycling through the exclusion list. Server logs a warning
at boot for any catalog entry older than 90 days so operators re-verify
against the provider's current ToS.
Robust Retry-After handling
Upstream Retry-After headers are now parsed in both integer-seconds
and HTTP-date formats, clamped into [1s, 10min], and honored on 5xx
responses as well as 429s. Absurd values like 99999999 can no longer
lock a key out for years, and past HTTP dates floor to one second.
Per-identifier rate limiting
Every request can now carry an X-FreeLLM-Identifier header tagging it
with a logical identity (app user id, session token, anything that
fits ^[A-Za-z0-9_.:-]{1,128}$). The gateway tracks requests per
identifier in an independent sliding-window bucket. One noisy user
hitting their cap doesn't affect anyone else.
- Configurable via
FREELLM_IDENTIFIER_LIMIT=<max>/<windowMs>, default 60/60000 - Hard ceiling of
FREELLM_IDENTIFIER_MAX_BUCKETSdistinct identifiers (default 10000) with LRU eviction on overflow - Idle buckets garbage-collected after 2x the window
- Synchronous check-and-increment so concurrent requests cannot race
- Missing header falls back to
ip:<client-ip> - Literal
"undefined"or"null"strings are treated as missing - Tainted values (control chars, spaces, too long) are rejected with a clear 400 instead of silently entering logs
- Responses carry
X-FreeLLM-Identifier,X-FreeLLM-Identifier-Remaining, andX-FreeLLM-Identifier-Resetso clients can self-throttle
Virtual sub-keys with soft caps
Operators can now declare virtual sub-keys in a JSON file pointed at by
FREELLM_VIRTUAL_KEYS_PATH. Each key can carry its own request cap,
token cap, model allowlist, and expiry:
{
"keys": [
{
"id": "sk-freellm-portfolio-abc123",
"label": "My portfolio site",
"dailyRequestCap": 500,
"dailyTokenCap": 200000,
"allowedModels": ["free-fast", "free"],
"expiresAt": "2026-07-01T00:00:00Z"
}
]
}The store is loaded at boot, Zod-validated, and rejects duplicate ids
and files larger than 1 MB. Virtual keys authenticate via
Authorization: Bearer sk-freellm-... alongside the existing
FREELLM_API_KEY master key. The chat route guards each request via
assertCanServe BEFORE routing to a provider (expiry, model allowlist,
request cap, token cap) and records usage AFTER a successful upstream
response, so failed routes never burn quota. Each cap hit returns its
own typed error: virtual_key_cap_reached, model_not_supported,
invalid_api_key.
Counters are in-memory, rolling 24 hours, reset on restart. This
is explicitly a soft cap (runaway-loop and abuse protection, not a
billing system). The server logs a loud warning at boot when any
virtual keys are loaded so operators cannot mistake it for billing.
Security, privacy, and benchmarks pages
Three new pages on the documentation website grouped under a new
Trust section:
/securitylists the six direct production dependencies, what is
deliberately not in the codebase (no telemetry, no runtime code
generation, no plugin loaders, no install-time scripts), how to
verify a deployed Docker image, and where to report vulnerabilities/privacyrenders the provider training-policy catalog with links
to each provider's own terms of service/benchmarkspublishes cold-start and per-request overhead numbers
rendered fromdocs/benchmarks.jsonwith a methodology section
Reproducible benchmark script
New scripts/bench.mjs spawns the built server against a fake
in-process upstream, measures boot time to first /healthz 200, then
runs cache-miss and cache-hit passes and writes docs/benchmarks.json.
Run it locally with node scripts/bench.mjs --print.
Reference numbers on a developer laptop:
- Cold start: ~127 ms (spawn to first
/healthz200) - Cache-miss overhead: p50 0.69 ms, p99 1.37 ms
- Cache-hit overhead: p50 0.34 ms, p99 0.92 ms
Continuous integration
New .github/workflows/ci.yml runs pnpm -r typecheck, the api-server
test suite, and pnpm audit --prod --audit-level=moderate on every
push and every pull request. Audit failures are tracked through
.github/audit-allowlist.json (a process contract, not an automated
bypass). Supports future badge wiring.
Test suite
FreeLLM now ships with 141 passing tests (up from 0) across eleven
test files, all green on every commit via CI:
errors.test.ts— exhaustive code-to-status and code-to-type coverage, factory, guard, serializer, and redact helperserrors-integration.test.ts— X-Request-Id propagation, canonical shape on 400/401 pathsstrict.test.ts— header parser, meta-model rejectionretry-advice.test.ts— per-provider and global earliest-retry math, hint ordering, suggestionsretry-after.test.ts— integer, fractional, HTTP-date, clamping at both ends, every invalid inputprivacy.test.ts— header parsing, catalog exhaustiveness, satisfaction (including unknown-id fail-closed), staleness mathidentifier-limiter.test.ts— sliding window, LRU, TTL, isolation, env parservirtual-keys.test.ts— construction, duplicate rejection, expiry, allowedModels, rolling-window cap enforcement, file loading edge casesrouter.test.ts— direct, failover, strict mode, privacy routing, Retry-After plumbinge2e.test.ts— real Express app against a fake upstream, full header assertionsmulti-tenant-e2e.test.ts— virtual key auth, cap enforcement, identifier middleware end-to-end
Changed
GatewayRouter.complete()now returns{ data, meta }with full
route metadata (provider, resolvedModel, requestedModel, cached,
reason, attempted providers) instead of just the chat completion- The central
errorHandlerruns every error class through
`normalizeEr...
v1.3.0 — Response caching
Response caching — same prompt twice returns the cached response in ~23ms
with zero provider quota burn. Verified end-to-end at 9× faster than
the cold path (200ms → 23ms).
Added
In-memory LRU response cache
- New
ResponseCacheclass with sha256-keyed exact-match lookup - Cache key built from
(model, messages, temperature, max_tokens, top_p, stop) - LRU eviction via Map re-insertion (recently-used entries stay at the end)
- Per-entry TTL expiry (default 1 hour, configurable)
- Default capacity 1000 entries (configurable)
- Cache hits short-circuit the entire routing flow:
no provider call, no token quota burn, no rate limiter increment - Streaming requests are never cached (the SSE protocol is incompatible)
- Errors are never cached (only successful 2xx responses)
Response markers
- Cached responses include
x_freellm_cached: true(alongsidex_freellm_provider) RequestLogEntrygained acached?: booleanfield- Token usage tracker is not incremented on cache hits (real cost = 0)
Cache stats on /v1/status
- New
cachefield with full counters:{ "enabled": true, "ttlMs": 3600000, "maxEntries": 1000, "currentSize": 12, "hits": 47, "misses": 8, "sets": 8, "evictions": 0, "hitRate": 0.8545 }
Configuration
CACHE_ENABLED(defaulttrue) — set tofalseto disableCACHE_TTL_MS(default3600000= 1 hour)CACHE_MAX_ENTRIES(default1000)
Dashboard
- New 5th metrics card "Cache Hits" (cyan, Database icon) with hit-rate sub-line
- Metrics row layout updated to 2/3/5 cols across mobile/medium/large breakpoints
- Recent requests table shows a
CACHEbadge next toOKfor cached rows
Why in-memory instead of SQLite
The original plan called for better-sqlite3, but it was rejected because:
- Native compilation risk —
better-sqlite3needsnode-gyp+ Python +
build tools at install time. Railway's slim image likely lacks them, which
would break the published Railway template's build. - Ephemeral filesystem on free tiers — Railway and Render free tiers
don't have persistent disk. A SQLite cache file would be wiped on every
restart anyway, requiring a paid persistent volume. - Architectural consistency — every other observability piece in
FreeLLM (RequestLog,RateLimiter,CircuitBreaker,UsageTracker)
is in-memory. Adding DB-backed storage for one feature would break the
pattern.
Cold cache warms up in seconds, restart loss is acceptable for a free-tier
gateway, and the entire feature ships with zero new dependencies (uses
Node's built-in crypto.createHash). The ResponseCache class lives behind
a clean interface, so swapping the storage to SQLite later is a one-file
change if persistence becomes a priority.
Verified end-to-end
Call A (cold) cached=false latency=200ms tokens=43+2 provider=groq
Call B (same) cached=true latency=23ms tokens=0 no upstream
Call C (same) cached=true latency=23ms tokens=0 no upstream
Call D (different prompt) cached=false latency=~200ms tokens=new provider=groq
9× speedup, 50% hit rate after 4 calls, all 18 gateway tests still passing.
v1.0.0 — First stable release
First stable release. Production-ready OpenAI-compatible gateway aggregating
6 free LLM providers with automatic failover, circuit breakers, and a
real-time dashboard.
Added
Gateway
- OpenAI-compatible
/v1/chat/completionsendpoint with streaming and non-streaming support - 6 LLM providers: Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama
- 25+ models across providers including Llama 3.3 70B, Gemini 2.5 Flash/Pro, Llama 4 Scout, Qwen3, Nemotron 70B, DeepSeek R1, GPT-OSS 120B
- Three meta-models:
free(round-robin),free-fast(latency-optimized),free-smart(capability-optimized) - Automatic failover across providers with configurable routing strategies (round-robin, random)
- Per-provider circuit breakers with three states (closed → open → half-open) and configurable thresholds
- Per-provider sliding-window rate limiting with conservative free-tier defaults
- Per-client (per-IP) rate limiting via
express-rate-limit - In-memory request log (last 500 requests) with stats and recent history
- Routing deadline (
ROUTE_TIMEOUT_MS) to prevent hung requests during cascading failures
Security
- Optional API key authentication (
FREELLM_API_KEY) using timing-safe SHA-256 comparison - Separate admin key (
FREELLM_ADMIN_KEY) protecting circuit breaker reset and routing strategy mutations - Configurable CORS origins (
ALLOWED_ORIGINS) - Body size limits on JSON and URL-encoded payloads
- Zod schema validation with strict mode and bounded
messages.max(256)/max_tokens.max(32768) - Upstream error sanitization (only safe
messagefield forwarded, never raw upstream JSON) - Production warning when running without API key auth
Dashboard
- React 18 + Vite + Tailwind SPA served by the same Express process in production
- Real-time provider health cards (circuit breaker state, success/failure counts, last error)
- Live request log with latency, status, model, and selected provider
- Routing strategy toggle (round-robin / random)
- Manual circuit breaker reset
- Models page with search and grouping by provider
- Mobile-responsive layout with slide-over menu
- New FreeLLM logo as favicon and Open Graph image
Deployment
- Multi-stage Dockerfile (Node 22 LTS, non-root
appuser, healthcheck baked in) docker-compose.ymlfor one-command local deploymentrailway.jsonfor Railway auto-detection with healthcheck and restart policy- Graceful shutdown on SIGTERM/SIGINT (drains in-flight requests, 8s deadline)
app.set("trust proxy", 1)for correct client IP behind reverse proxies- Static dashboard serving with SPA fallback for client-side routing
- Production-ready logging via Pino with structured JSON output
Developer Experience
- pnpm workspace monorepo with shared dependency catalog
- TypeScript 5.9 across all packages with
bundlermodule resolution - esbuild bundle for the API server with CJS shim for Pino compatibility
- OpenAPI 3.1 spec as the single source of truth for the API client
- Auto-generated React Query hooks via Orval (
@workspace/api-client-react) - Knip configuration for unused export detection
scripts/test-gateway.shend-to-end test suite with 18 checks (health, models, status, completions, streaming, NIM direct, validation)
Documentation
- Comprehensive README with quickstart (Docker + local), provider table, API reference, security guide, and tech stack
- Mermaid diagrams for request lifecycle, circuit breaker state machine, routing strategies, and high-level architecture
- MIT license
- Architecture refactor plan in
docs/superpowers/plans/