v1.4.0: Honest Gateway
The honest-gateway release. FreeLLM now tells you exactly which provider
answered your request, lets you refuse silent downgrades, routes around
providers that train on your prompts, returns enriched retry hints on 429s,
and can be safely exposed to an app's end users via virtual sub-keys and
per-identifier rate limiting. Everything ships with a real test suite and
no new runtime dependencies.
Added
Transparent routing headers
Every chat completion response now carries observability headers so
clients can see exactly how the request was handled:
X-FreeLLM-Provider— the concrete provider id that served the responseX-FreeLLM-Model— the resolved concrete model idX-FreeLLM-Requested-Model— the original model asked forX-FreeLLM-Cached—truewhen the response came from the cacheX-FreeLLM-Route-Reason— one ofdirect,meta,cache,failoverX-Request-Id— a unique trace id that also appears in logs and error bodies
Strict mode
Opt-in via X-FreeLLM-Strict: true. In strict mode the router refuses
to substitute models. Meta-models (free, free-fast, free-smart)
are rejected with a clear 400. Concrete models are tried against
exactly one provider and the upstream error surfaces verbatim if that
provider fails. No silent failover, no cache hit masquerading as fresh.
Actionable 429 bodies
When all providers are exhausted, the gateway now returns a structured
body instead of a generic error:
{
"error": {
"type": "rate_limit_error",
"code": "all_providers_exhausted",
"message": "...",
"retry_after_ms": 12000,
"providers": [
{ "id": "groq", "retry_after_ms": 12000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" },
{ "id": "gemini", "retry_after_ms": 5000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" }
],
"suggestions": [
{ "model": "free-fast", "available_in_ms": 5000 },
{ "model": "free-smart", "available_in_ms": 5000 }
],
"request_id": "..."
}
}The response also carries an HTTP Retry-After header in seconds.
Unified error SDK
New src/errors/ module defines the one and only error taxonomy the
gateway emits. Fifteen concrete error codes grouped into seven types
that match OpenAI's shape (invalid_request_error, authentication_error,
permission_error, not_found_error, rate_limit_error, provider_error,
internal_error). Every middleware delegates via next(freellmError(...))
instead of writing response bodies directly, and the central error
handler funnels everything through a single toBody() serializer.
freellmError({ code, message, ...context })factoryhttpStatusFor(code)andtypeFor(code)lookup tablestoBody(err, requestId)never throws, falls back to
internal_server_errorenvelope for unknown inputredactSecrets(message)strips Bearer tokens, API-key-looking values,
and long hex sequences from error messages before they go on the wire
Request id propagation
New request-id middleware mounts first in the pipeline and assigns
every request a UUID (honors an inbound X-Request-Id matching
^[A-Za-z0-9_.:-]{1,128}$ so distributed traces can thread through).
The same id flows into the response header, the error body, and every
pino log line via genReqId so a single grep correlates access logs,
error logs, and bug reports.
Privacy and training-policy routing
New X-FreeLLM-Privacy: no-training header filters the router's
candidate list to providers that contractually exclude free-tier data
from training. Backed by a new PROVIDER_PRIVACY catalog with source
URLs and last-verified dates for every shipped provider:
| Provider | Policy |
|---|---|
| Groq | no-training |
| Cerebras | no-training |
| NVIDIA NIM | no-training |
| Ollama | local |
| Mistral | configurable |
| Gemini | free-tier trains |
When no provider can satisfy the posture for the requested model, the
gateway returns a 400 model_not_supported up front instead of
pointlessly cycling through the exclusion list. Server logs a warning
at boot for any catalog entry older than 90 days so operators re-verify
against the provider's current ToS.
Robust Retry-After handling
Upstream Retry-After headers are now parsed in both integer-seconds
and HTTP-date formats, clamped into [1s, 10min], and honored on 5xx
responses as well as 429s. Absurd values like 99999999 can no longer
lock a key out for years, and past HTTP dates floor to one second.
Per-identifier rate limiting
Every request can now carry an X-FreeLLM-Identifier header tagging it
with a logical identity (app user id, session token, anything that
fits ^[A-Za-z0-9_.:-]{1,128}$). The gateway tracks requests per
identifier in an independent sliding-window bucket. One noisy user
hitting their cap doesn't affect anyone else.
- Configurable via
FREELLM_IDENTIFIER_LIMIT=<max>/<windowMs>, default 60/60000 - Hard ceiling of
FREELLM_IDENTIFIER_MAX_BUCKETSdistinct identifiers (default 10000) with LRU eviction on overflow - Idle buckets garbage-collected after 2x the window
- Synchronous check-and-increment so concurrent requests cannot race
- Missing header falls back to
ip:<client-ip> - Literal
"undefined"or"null"strings are treated as missing - Tainted values (control chars, spaces, too long) are rejected with a clear 400 instead of silently entering logs
- Responses carry
X-FreeLLM-Identifier,X-FreeLLM-Identifier-Remaining, andX-FreeLLM-Identifier-Resetso clients can self-throttle
Virtual sub-keys with soft caps
Operators can now declare virtual sub-keys in a JSON file pointed at by
FREELLM_VIRTUAL_KEYS_PATH. Each key can carry its own request cap,
token cap, model allowlist, and expiry:
{
"keys": [
{
"id": "sk-freellm-portfolio-abc123",
"label": "My portfolio site",
"dailyRequestCap": 500,
"dailyTokenCap": 200000,
"allowedModels": ["free-fast", "free"],
"expiresAt": "2026-07-01T00:00:00Z"
}
]
}The store is loaded at boot, Zod-validated, and rejects duplicate ids
and files larger than 1 MB. Virtual keys authenticate via
Authorization: Bearer sk-freellm-... alongside the existing
FREELLM_API_KEY master key. The chat route guards each request via
assertCanServe BEFORE routing to a provider (expiry, model allowlist,
request cap, token cap) and records usage AFTER a successful upstream
response, so failed routes never burn quota. Each cap hit returns its
own typed error: virtual_key_cap_reached, model_not_supported,
invalid_api_key.
Counters are in-memory, rolling 24 hours, reset on restart. This
is explicitly a soft cap (runaway-loop and abuse protection, not a
billing system). The server logs a loud warning at boot when any
virtual keys are loaded so operators cannot mistake it for billing.
Security, privacy, and benchmarks pages
Three new pages on the documentation website grouped under a new
Trust section:
/securitylists the six direct production dependencies, what is
deliberately not in the codebase (no telemetry, no runtime code
generation, no plugin loaders, no install-time scripts), how to
verify a deployed Docker image, and where to report vulnerabilities/privacyrenders the provider training-policy catalog with links
to each provider's own terms of service/benchmarkspublishes cold-start and per-request overhead numbers
rendered fromdocs/benchmarks.jsonwith a methodology section
Reproducible benchmark script
New scripts/bench.mjs spawns the built server against a fake
in-process upstream, measures boot time to first /healthz 200, then
runs cache-miss and cache-hit passes and writes docs/benchmarks.json.
Run it locally with node scripts/bench.mjs --print.
Reference numbers on a developer laptop:
- Cold start: ~127 ms (spawn to first
/healthz200) - Cache-miss overhead: p50 0.69 ms, p99 1.37 ms
- Cache-hit overhead: p50 0.34 ms, p99 0.92 ms
Continuous integration
New .github/workflows/ci.yml runs pnpm -r typecheck, the api-server
test suite, and pnpm audit --prod --audit-level=moderate on every
push and every pull request. Audit failures are tracked through
.github/audit-allowlist.json (a process contract, not an automated
bypass). Supports future badge wiring.
Test suite
FreeLLM now ships with 141 passing tests (up from 0) across eleven
test files, all green on every commit via CI:
errors.test.ts— exhaustive code-to-status and code-to-type coverage, factory, guard, serializer, and redact helperserrors-integration.test.ts— X-Request-Id propagation, canonical shape on 400/401 pathsstrict.test.ts— header parser, meta-model rejectionretry-advice.test.ts— per-provider and global earliest-retry math, hint ordering, suggestionsretry-after.test.ts— integer, fractional, HTTP-date, clamping at both ends, every invalid inputprivacy.test.ts— header parsing, catalog exhaustiveness, satisfaction (including unknown-id fail-closed), staleness mathidentifier-limiter.test.ts— sliding window, LRU, TTL, isolation, env parservirtual-keys.test.ts— construction, duplicate rejection, expiry, allowedModels, rolling-window cap enforcement, file loading edge casesrouter.test.ts— direct, failover, strict mode, privacy routing, Retry-After plumbinge2e.test.ts— real Express app against a fake upstream, full header assertionsmulti-tenant-e2e.test.ts— virtual key auth, cap enforcement, identifier middleware end-to-end
Changed
GatewayRouter.complete()now returns{ data, meta }with full
route metadata (provider, resolvedModel, requestedModel, cached,
reason, attempted providers) instead of just the chat completion- The central
errorHandlerruns every error class through
normalizeError+toBody, replacing the previous per-class
instanceofbranching and ad hoc JSON shaping AllProvidersExhaustedErrorandProviderClientErrorare now
internal signals only; callers never see the class name on the wireauthmiddleware accepts either the master key or a virtual key,
populatesreq.virtualKeyon virtual-key matchesvalidatemiddleware sanitizes Zod error messages via
redactSecretsso prompts containing leaked API keys cannot echo
back in the errorrate-limitmiddleware swapped toexpress-rate-limit'shandler
hook so its 429 body matches every other 429 the gateway produces- Website build toolchain (
astro,@astrojs/starlight,
@astrojs/check,sharp,typescript) moved fromdependencies
todevDependenciessopnpm audit --proddoes not traverse
build-only packages
Fixed
- Patched three GitHub Security Advisories flagged by the new CI audit
step: GHSA-p9ff-h696-f583 (high) and GHSA-4w7w-66w2-5vf9
(moderate) for Vite below 6.4.2, and GHSA-48c2-rrv3-qjmp
(moderate) for yaml below 2.8.3 buried under@astrojs/check.
Fixed viapnpm.overridesin the rootpackage.jsonforcing the
patched versions regardless of transitive chain, a workspace
catalog bump tovite ^6.4.2, and moving the website build
toolchain todevDependencies.
Configuration
New environment variables:
| Variable | Default | What it does |
|---|---|---|
FREELLM_IDENTIFIER_LIMIT |
60/60000 |
Per-identifier rate limit, format <max>/<windowMs> |
FREELLM_IDENTIFIER_MAX_BUCKETS |
10000 |
Hard ceiling on distinct identifiers tracked |
FREELLM_VIRTUAL_KEYS_PATH |
unset | Path to a JSON file declaring virtual sub-keys |
Migration
Fully backwards compatible. Every new capability is opt-in via a
request header (X-FreeLLM-Strict, X-FreeLLM-Privacy,
X-FreeLLM-Identifier) or an environment variable. Existing clients
see richer response headers and enriched 429 bodies automatically,
but no behavioral change unless they opt into strict mode.
Error response shapes are slightly different: the type field now
uses OpenAI's taxonomy (invalid_request_error,
authentication_error, etc.) and every response carries a code
field plus a request_id. Clients that were pattern-matching on
message strings should move to code-based dispatch.