Release v1.4.0: Honest Gateway · Devansh-365/freellm

The honest-gateway release. FreeLLM now tells you exactly which provider
answered your request, lets you refuse silent downgrades, routes around
providers that train on your prompts, returns enriched retry hints on 429s,
and can be safely exposed to an app's end users via virtual sub-keys and
per-identifier rate limiting. Everything ships with a real test suite and
no new runtime dependencies.

Added

Transparent routing headers

Every chat completion response now carries observability headers so
clients can see exactly how the request was handled:

X-FreeLLM-Provider — the concrete provider id that served the response
X-FreeLLM-Model — the resolved concrete model id
X-FreeLLM-Requested-Model — the original model asked for
X-FreeLLM-Cached — true when the response came from the cache
X-FreeLLM-Route-Reason — one of direct, meta, cache, failover
X-Request-Id — a unique trace id that also appears in logs and error bodies

Strict mode

Opt-in via X-FreeLLM-Strict: true. In strict mode the router refuses
to substitute models. Meta-models (free, free-fast, free-smart)
are rejected with a clear 400. Concrete models are tried against
exactly one provider and the upstream error surfaces verbatim if that
provider fails. No silent failover, no cache hit masquerading as fresh.

Actionable 429 bodies

When all providers are exhausted, the gateway now returns a structured
body instead of a generic error:

{
  "error": {
    "type": "rate_limit_error",
    "code": "all_providers_exhausted",
    "message": "...",
    "retry_after_ms": 12000,
    "providers": [
      { "id": "groq",   "retry_after_ms": 12000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" },
      { "id": "gemini", "retry_after_ms": 5000,  "keys_available": 0, "keys_total": 1, "circuit_state": "closed" }
    ],
    "suggestions": [
      { "model": "free-fast",  "available_in_ms": 5000 },
      { "model": "free-smart", "available_in_ms": 5000 }
    ],
    "request_id": "..."
  }
}

The response also carries an HTTP Retry-After header in seconds.

Unified error SDK

New src/errors/ module defines the one and only error taxonomy the
gateway emits. Fifteen concrete error codes grouped into seven types
that match OpenAI's shape (invalid_request_error, authentication_error,
permission_error, not_found_error, rate_limit_error, provider_error,
internal_error). Every middleware delegates via next(freellmError(...))
instead of writing response bodies directly, and the central error
handler funnels everything through a single toBody() serializer.

freellmError({ code, message, ...context }) factory
httpStatusFor(code) and typeFor(code) lookup tables
toBody(err, requestId) never throws, falls back to
internal_server_error envelope for unknown input
redactSecrets(message) strips Bearer tokens, API-key-looking values,
and long hex sequences from error messages before they go on the wire

Request id propagation

New request-id middleware mounts first in the pipeline and assigns
every request a UUID (honors an inbound X-Request-Id matching
^[A-Za-z0-9_.:-]{1,128}$ so distributed traces can thread through).
The same id flows into the response header, the error body, and every
pino log line via genReqId so a single grep correlates access logs,
error logs, and bug reports.

Privacy and training-policy routing

New X-FreeLLM-Privacy: no-training header filters the router's
candidate list to providers that contractually exclude free-tier data
from training. Backed by a new PROVIDER_PRIVACY catalog with source
URLs and last-verified dates for every shipped provider:

Provider	Policy
Groq	no-training
Cerebras	no-training
NVIDIA NIM	no-training
Ollama	local
Mistral	configurable
Gemini	free-tier trains

When no provider can satisfy the posture for the requested model, the
gateway returns a 400 model_not_supported up front instead of
pointlessly cycling through the exclusion list. Server logs a warning
at boot for any catalog entry older than 90 days so operators re-verify
against the provider's current ToS.

Robust Retry-After handling

Upstream Retry-After headers are now parsed in both integer-seconds
and HTTP-date formats, clamped into [1s, 10min], and honored on 5xx
responses as well as 429s. Absurd values like 99999999 can no longer
lock a key out for years, and past HTTP dates floor to one second.

Per-identifier rate limiting

Every request can now carry an X-FreeLLM-Identifier header tagging it
with a logical identity (app user id, session token, anything that
fits ^[A-Za-z0-9_.:-]{1,128}$). The gateway tracks requests per
identifier in an independent sliding-window bucket. One noisy user
hitting their cap doesn't affect anyone else.

Configurable via FREELLM_IDENTIFIER_LIMIT=<max>/<windowMs>, default 60/60000
Hard ceiling of FREELLM_IDENTIFIER_MAX_BUCKETS distinct identifiers (default 10000) with LRU eviction on overflow
Idle buckets garbage-collected after 2x the window
Synchronous check-and-increment so concurrent requests cannot race
Missing header falls back to ip:<client-ip>
Literal "undefined" or "null" strings are treated as missing
Tainted values (control chars, spaces, too long) are rejected with a clear 400 instead of silently entering logs
Responses carry X-FreeLLM-Identifier, X-FreeLLM-Identifier-Remaining, and X-FreeLLM-Identifier-Reset so clients can self-throttle

Virtual sub-keys with soft caps

Operators can now declare virtual sub-keys in a JSON file pointed at by
FREELLM_VIRTUAL_KEYS_PATH. Each key can carry its own request cap,
token cap, model allowlist, and expiry:

{
  "keys": [
    {
      "id": "sk-freellm-portfolio-abc123",
      "label": "My portfolio site",
      "dailyRequestCap": 500,
      "dailyTokenCap": 200000,
      "allowedModels": ["free-fast", "free"],
      "expiresAt": "2026-07-01T00:00:00Z"
    }
  ]
}

The store is loaded at boot, Zod-validated, and rejects duplicate ids
and files larger than 1 MB. Virtual keys authenticate via
Authorization: Bearer sk-freellm-... alongside the existing
FREELLM_API_KEY master key. The chat route guards each request via
assertCanServe BEFORE routing to a provider (expiry, model allowlist,
request cap, token cap) and records usage AFTER a successful upstream
response, so failed routes never burn quota. Each cap hit returns its
own typed error: virtual_key_cap_reached, model_not_supported,
invalid_api_key.

Counters are in-memory, rolling 24 hours, reset on restart. This
is explicitly a soft cap (runaway-loop and abuse protection, not a
billing system). The server logs a loud warning at boot when any
virtual keys are loaded so operators cannot mistake it for billing.

Security, privacy, and benchmarks pages

Three new pages on the documentation website grouped under a new
Trust section:

/security lists the six direct production dependencies, what is
deliberately not in the codebase (no telemetry, no runtime code
generation, no plugin loaders, no install-time scripts), how to
verify a deployed Docker image, and where to report vulnerabilities
/privacy renders the provider training-policy catalog with links
to each provider's own terms of service
/benchmarks publishes cold-start and per-request overhead numbers
rendered from docs/benchmarks.json with a methodology section

Reproducible benchmark script

New scripts/bench.mjs spawns the built server against a fake
in-process upstream, measures boot time to first /healthz 200, then
runs cache-miss and cache-hit passes and writes docs/benchmarks.json.
Run it locally with node scripts/bench.mjs --print.

Reference numbers on a developer laptop:

Cold start: ~127 ms (spawn to first /healthz 200)
Cache-miss overhead: p50 0.69 ms, p99 1.37 ms
Cache-hit overhead: p50 0.34 ms, p99 0.92 ms

Continuous integration

New .github/workflows/ci.yml runs pnpm -r typecheck, the api-server
test suite, and pnpm audit --prod --audit-level=moderate on every
push and every pull request. Audit failures are tracked through
.github/audit-allowlist.json (a process contract, not an automated
bypass). Supports future badge wiring.

Test suite

FreeLLM now ships with 141 passing tests (up from 0) across eleven
test files, all green on every commit via CI:

errors.test.ts — exhaustive code-to-status and code-to-type coverage, factory, guard, serializer, and redact helpers
errors-integration.test.ts — X-Request-Id propagation, canonical shape on 400/401 paths
strict.test.ts — header parser, meta-model rejection
retry-advice.test.ts — per-provider and global earliest-retry math, hint ordering, suggestions
retry-after.test.ts — integer, fractional, HTTP-date, clamping at both ends, every invalid input
privacy.test.ts — header parsing, catalog exhaustiveness, satisfaction (including unknown-id fail-closed), staleness math
identifier-limiter.test.ts — sliding window, LRU, TTL, isolation, env parser
virtual-keys.test.ts — construction, duplicate rejection, expiry, allowedModels, rolling-window cap enforcement, file loading edge cases
router.test.ts — direct, failover, strict mode, privacy routing, Retry-After plumbing
e2e.test.ts — real Express app against a fake upstream, full header assertions
multi-tenant-e2e.test.ts — virtual key auth, cap enforcement, identifier middleware end-to-end

Changed

GatewayRouter.complete() now returns { data, meta } with full
route metadata (provider, resolvedModel, requestedModel, cached,
reason, attempted providers) instead of just the chat completion
The central errorHandler runs every error class through
normalizeError + toBody, replacing the previous per-class
instanceof branching and ad hoc JSON shaping
AllProvidersExhaustedError and ProviderClientError are now
internal signals only; callers never see the class name on the wire
auth middleware accepts either the master key or a virtual key,
populates req.virtualKey on virtual-key matches
validate middleware sanitizes Zod error messages via
redactSecrets so prompts containing leaked API keys cannot echo
back in the error
rate-limit middleware swapped to express-rate-limit's handler
hook so its 429 body matches every other 429 the gateway produces
Website build toolchain (astro, @astrojs/starlight,
@astrojs/check, sharp, typescript) moved from dependencies
to devDependencies so pnpm audit --prod does not traverse
build-only packages

Fixed

Patched three GitHub Security Advisories flagged by the new CI audit
step: GHSA-p9ff-h696-f583 (high) and GHSA-4w7w-66w2-5vf9
(moderate) for Vite below 6.4.2, and GHSA-48c2-rrv3-qjmp
(moderate) for yaml below 2.8.3 buried under @astrojs/check.
Fixed via pnpm.overrides in the root package.json forcing the
patched versions regardless of transitive chain, a workspace
catalog bump to vite ^6.4.2, and moving the website build
toolchain to devDependencies.

Configuration

New environment variables:

Variable	Default	What it does
`FREELLM_IDENTIFIER_LIMIT`	`60/60000`	Per-identifier rate limit, format `<max>/<windowMs>`
`FREELLM_IDENTIFIER_MAX_BUCKETS`	`10000`	Hard ceiling on distinct identifiers tracked
`FREELLM_VIRTUAL_KEYS_PATH`	unset	Path to a JSON file declaring virtual sub-keys

Migration

Fully backwards compatible. Every new capability is opt-in via a
request header (X-FreeLLM-Strict, X-FreeLLM-Privacy,
X-FreeLLM-Identifier) or an environment variable. Existing clients
see richer response headers and enriched 429 bodies automatically,
but no behavioral change unless they opt into strict mode.

Error response shapes are slightly different: the type field now
uses OpenAI's taxonomy (invalid_request_error,
authentication_error, etc.) and every response carries a code
field plus a request_id. Clients that were pattern-matching on
message strings should move to code-based dispatch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.0: Honest Gateway

Choose a tag to compare

Sorry, something went wrong.