Releases · Devansh-365/freellm

12 Apr 10:32

v1.5.1

cac42fe

v1.5.1: Gemini reasoning fix, NIM json_schema, truncation warning Latest

Latest

Fixes a user-reported bug where Gemini 2.5 Flash returned ~30 tokens
regardless of how high max_tokens was set, hardens the response cache
against poisoning by truncated responses, and adds JSON mode
reliability improvements for NVIDIA NIM and all providers.

Fixed

Gemini 2.5 reasoning budget no longer eats max_tokens

Gemini 2.5 Flash and 2.5 Pro are reasoning models. With their default
thinking budget they burned 90-98% of max_tokens on internal reasoning
before producing visible text. A caller asking for max_tokens=1000
routinely got back 30-40 tokens with finish_reason=length.

The Gemini provider adapter now injects a per-model default
reasoning_effort when the caller does not set one:

gemini-2.5-flash defaults to "none" (accepts zero thinking budget,
returns the full requested output)
gemini-2.5-pro defaults to "low" (the minimum Google accepts for
this model, which requires a non-zero thinking budget)

Clients that want full reasoning can pass reasoning_effort: "high"
explicitly. The adapter also normalizes the output budget onto
max_completion_tokens only and deletes max_tokens from the
outgoing request, because Gemini returns 400 when both are present.

Verified against the live Gemini API: the same prompt that produced
37 tokens before now returns 670+ tokens with finish_reason=stop.

Deprecated Gemini 2.0 models removed from catalog

gemini-2.0-flash and gemini-2.0-flash-lite both returned 404 "no
longer available to new users" from the live API. Removed from the
model list so callers cannot pick a dead model.

Response cache no longer stores truncated responses

When any choice in the upstream response carries
finish_reason=length, the cache now refuses to store it. Previously
a single truncated response from a reasoning model would pin the
bad answer for the entire TTL window (default 1 hour), causing the
"sometimes fails" pattern the user reported.

Cache key expanded to prevent cross-shape collisions

The cache key now includes tools, tool_choice, parallel_tool_calls,
response_format, reasoning_effort, seed, max_completion_tokens,
presence_penalty, and frequency_penalty. Previously two requests
with the same prompt but different tool definitions or response
formats would share a cache entry.

Added

NVIDIA NIM json_schema translation

NIM's OpenAI-compat endpoint does not support
response_format: { type: "json_schema" }. It requires the schema
in a vendor-specific nvext.guided_json field. The NIM provider
adapter now translates the standard parameter into the NIM-native
format automatically and removes the unsupported response_format
field from the outgoing request. json_object mode and requests
without response_format pass through untouched.

JSON truncation warning header

When a JSON-mode request (json_object or json_schema) hits
max_tokens and the output is almost certainly broken mid-token,
the response now carries:

X-FreeLLM-Warning: json-possibly-truncated

The caller knows immediately that the JSON is likely incomplete
without needing to attempt a parse.

reasoning_effort accepted in the request schema

The Zod schema now accepts reasoning_effort: "none" | "low" | "medium" | "high" so clients can override the per-model default.
Matches Gemini's OpenAI-compat knob and OpenAI's o-series
reasoning parameter.

finish_reason surfaced in the request log

The request log entry now carries a finishReason field populated
from the upstream response. When the reason is "length" the router
also emits a pino warning tagged with provider, model, max_tokens,
and reasoning_effort so operators can see which requests hit the
token cap.

Configuration

New environment variable:

Variable	Default	What it does
`reasoning_effort` (request field)	Per-model	`"none"` for gemini-2.5-flash, `"low"` for gemini-2.5-pro. Clients can override per request.

Tests

262 passing across 22 files (up from 215 at v1.5.0):

gemini-provider.test.ts 14 tests for per-model defaults, mapRequest
normalization, and catalog verification
cache.test.ts 17 tests for isCacheable, truncation skip, and key
discrimination across all request shape fields
nim-provider.test.ts 6 tests for json_schema-to-nvext translation,
passthrough, and field preservation
json-truncation.test.ts 4 e2e tests for the warning header
schema-tools.test.ts 2 new tests for reasoning_effort
router.test.ts 4 new tests for finish_reason handling and
cache anti-poisoning

Assets 2

09 Apr 02:13

Devansh-365

v1.5.0

19f57b4

v1.5.0: Browser-safe tokens and streaming correctness

The browser-safe release. FreeLLM can now be safely exposed to an app's
end users via short-lived HMAC-signed tokens that are bound to an origin
and an identifier. Streaming tool calls from Gemini and Ollama are fixed
at the gateway, the Zod schema finally accepts the full OpenAI request
shape, the dashboard surfaces the new trust layer, and every layer was
verified end-to-end with the real openai npm SDK hitting real provider
APIs.

Added

Browser-safe short-lived tokens

Operators can now mint stateless HMAC-signed bearer tokens that are
safe to ship to a browser. Integration pattern: a one-file serverless
function calls POST /v1/tokens/issue with the master or virtual key,
returns the minted token to the browser, and the browser uses the
token directly with any OpenAI SDK. No auth backend, no session
store, no database.

New module src/gateway/browser-token.ts with pure signBrowserToken
and verifyBrowserToken helpers. HMAC-SHA256 over a JSON payload,
constant-time signature comparison, base64url encoding.
Token format: flt.<base64url(payload)>.<hex(hmac)>.
Payload v1 carries v, iat, exp, origin, and optional
identifier and vk (virtual key id).
Max TTL 900 seconds (15 minutes), clamped at issue time.
Origin is embedded in the token and compared against the browser's
Origin header on every verify. Mismatch = reject.
FREELLM_TOKEN_SECRET must be at least 32 bytes. Enforced both at
boot and on every sign/verify operation. A short secret is a
fatal boot failure so no path can produce a weak token.
Constant-time signature comparison via timingSafeEqual.
Secret rotation immediately invalidates all outstanding tokens
(intentional kill switch for compromised deployments).

`POST /v1/tokens/issue` endpoint

Admin-auth-equivalent mint endpoint that any existing master key,
admin key, or virtual key can call. Browser tokens themselves cannot
mint new browser tokens (chain guard). When a virtual key is the
issuer, the resulting token inherits its id so the existing Phase 2
cap enforcement flows through untouched.

curl https://your-gateway/v1/tokens/issue \
  -H "Authorization: Bearer $FREELLM_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "origin": "https://yoursite.com",
    "identifier": "session-abc",
    "ttlSeconds": 900
  }'

# Response:
# {
#   "token": "flt.eyJ2IjoxLCJpYXQiOi...",
#   "expiresAt": "2026-04-09T07:15:00.000Z",
#   "origin": "https://yoursite.com",
#   "identifier": "session-abc"
# }

Origin must be https://* or http://localhost[:port]. Identifier
must match ^[A-Za-z0-9_.:-]{1,128}$, same pattern as the per-user
rate limiter. TTL clamped to [1, 900] seconds.

Browser token authentication

The auth middleware now recognizes flt.* bearers alongside the
master key, admin key, and virtual keys. Verification checks the
signature, expiry, and Origin header atomically. On success:

req.browserToken carries the verified payload
req.virtualKey is hydrated if the token carries a virtual key id
The token's identifier is copied into X-FreeLLM-Identifier so the
existing per-user rate limiter picks it up with zero extra code
All other middleware (privacy routing, strict mode, streaming
normalizer, cap enforcement) works unchanged

On failure the middleware returns 401 invalid_api_key with the
specific rejection reason in the message (origin_mismatch, expired,
bad_signature, etc.).

Streaming tool_call normalization

Every chat completion stream now flows through a per-provider
normalizer that sits between the upstream fetch and the downstream
response. Fixes three widely-reported bugs without touching the
upstreams or the clients:

Gemini index field missing on streaming tool_call deltas: the
normalizer assigns an index per function name so argument fragments
across chunks land on the same logical tool call, and stamps
type: "function" where it's missing. Multi-tool parallel calls
get distinct indices tracked by name.
Ollama flat-argument hoisting: Ollama sometimes emits
arguments at the top level of a tool_call rather than inside
function. The normalizer hoists it into the expected shape and
reuses the last-seen index for subsequent fragments.
Malformed chunk resilience: every transform is wrapped in
try/catch so one bad byte from upstream logs a warning and forwards
the original event verbatim instead of crashing the whole response.

New module tree under src/gateway/streaming/:

sse.ts tolerant SSE parser and serializer, handles CRLF, partial
chunks, [DONE], and comment heartbeats.
types.ts minimal OpenAI chunk shape plus Normalizer interface.
normalizer.ts per-provider dispatcher with passthrough default.
passthrough.ts no-op for Groq, Cerebras, NIM, and Mistral.
gemini.ts and ollama.ts provider-specific fixes.
pipeline.ts ties parser, dispatcher, and serializer together with
defensive try/catch at every stage.

Heartbeat comment \n: keep-alive\n\n is written every
STREAM_IDLE_TIMEOUT_MS (default 30s) so proxies on the path like
Railway and Cloudflare do not drop slow streams. Client disconnect
is detected via res.on('close') and cancels the upstream reader
so we do not burn provider quota into a dead socket.

Full OpenAI request shape accepted

The Zod schema and matching TypeScript types now accept the full
OpenAI Chat Completions surface, including:

tools array with nested function definition, name, description,
parameters, and strict flag
tool_choice as "none" | "auto" | "required" or a specific
function reference
parallel_tool_calls
response_format text / json_object / json_schema envelope
stream_options.include_usage for per-chunk usage accounting
max_completion_tokens alongside max_tokens
presence_penalty and frequency_penalty validated in [-2, 2]
seed and user for observability
tool_call_id on tool-role messages
tool_calls array on assistant-role messages
developer role (used by newer OpenAI models)

Strict mode is preserved so genuinely unknown fields still throw.
Only the known OpenAI surface is accepted.

Caught while verifying the streaming normalizer with a real openai
npm SDK client: the previous schema rejected every tool-calling
request with 400 before it ever reached the streaming pipeline, so
the normalizer was unreachable in practice for the exact audience it
was built for. Fixed and regression-tested against the shapes the
real SDK sends.

Dashboard v1.5 surface

Provider cards carry a color-coded Trust badge:
NO-TRAIN (emerald), LOCAL (sky), CONFIG (amber), or
TRAINS (rose). Click opens the provider's actual terms of service
URL in a new tab; hover shows the last-verified date.
Virtual Keys panel on the main dashboard page lists every
loaded virtual sub-key with per-key progress bars for daily request
cap and daily token cap (rose tint when usage crosses 90%), allowed
models surfaced as monospace badges, expired keys flagged, and a
persistent amber reminder that caps are soft.
Browser Tokens card shows a live enabled/disabled status with
an emerald pulsing dot when FREELLM_TOKEN_SECRET is set, max TTL
and min secret length visible at a glance, a link out to the
/browser-integration docs page, and an amber callout on the
disabled state telling operators exactly which env var to set.
GET /v1/status response grew a new browserTokens field
{ enabled, minSecretBytes, maxTtlSeconds } plus a privacy block
on every ProviderStatus sourced from the PROVIDER_PRIVACY
catalog.
New admin-only endpoint GET /v1/status/virtual-keys returns a
masked inventory of loaded virtual sub-keys (first 12 chars plus
last 4) with cap limits, remaining usage, allowed models, expiry,
and a softCapWarning string reminding the caller that counters
reset on restart.

Browser integration docs and runnable example

packages/website/src/content/docs/browser-integration.mdx walks
through the full flow: mint from a Node or Flask backend, use the
token in the browser via the official openai SDK (with
dangerouslyAllowBrowser: true), handle expiry, verify origin
binding, secret rotation, common pitfalls.
New Integration sidebar group on the Starlight website. 17
pages total, up from 16.
examples/browser-chatbot/ copy-paste runnable demo: 108-line
self-contained HTML chatbot using the openai npm package via
esm.sh, a 37-line Vercel serverless function that mints tokens
server-side from env vars, and a 58-line README covering setup,
local development, Vercel deploy, and security notes. No
frameworks, no build step.

Changed

auth middleware now accepts FREELLM_ADMIN_KEY as a valid base
credential alongside the master key and virtual keys, so the
admin_required 403 path is actually reachable when an operator
runs with a distinct admin token. Previously the admin-only routes
were unreachable without also setting FREELLM_API_KEY.
ProviderStatusInfo gained an optional privacy block populated
from the PROVIDER_PRIVACY catalog with policy, sourceUrl, and
lastVerified fields.
GatewayStatus gained a browserTokens block surfacing the boot
state of the feature.

Configuration

New environment variables:

Variable	Default	What it does
`FREELLM_TOKEN_SECRET`	unset	HMAC secret for browser tokens, minimum 32 bytes. Short = fatal boot failure. Unset = browser tokens disabled, the rest of the gateway runs unchanged.
`STREAM_IDLE_TIMEOUT_MS`	`30000`	Heartbeat cadence for the SSE keep-alive comment.

Previously shipped in v1.4.0 and still supported:
FREELLM_IDENTIFIER_LIMIT, FREELLM_IDENTIFIER_MAX_BUCKETS,
FREELLM_VIRTUAL_KEYS_PATH.

Tests

232 passing tests across 19 files (up from 146 at v1.4.0):

`tests/browser-token.test...

Assets 2

08 Apr 21:53

Devansh-365

v1.4.0

27f6e16

v1.4.0: Honest Gateway

The honest-gateway release. FreeLLM now tells you exactly which provider
answered your request, lets you refuse silent downgrades, routes around
providers that train on your prompts, returns enriched retry hints on 429s,
and can be safely exposed to an app's end users via virtual sub-keys and
per-identifier rate limiting. Everything ships with a real test suite and
no new runtime dependencies.

Added

Transparent routing headers

Every chat completion response now carries observability headers so
clients can see exactly how the request was handled:

X-FreeLLM-Provider — the concrete provider id that served the response
X-FreeLLM-Model — the resolved concrete model id
X-FreeLLM-Requested-Model — the original model asked for
X-FreeLLM-Cached — true when the response came from the cache
X-FreeLLM-Route-Reason — one of direct, meta, cache, failover
X-Request-Id — a unique trace id that also appears in logs and error bodies

Strict mode

Opt-in via X-FreeLLM-Strict: true. In strict mode the router refuses
to substitute models. Meta-models (free, free-fast, free-smart)
are rejected with a clear 400. Concrete models are tried against
exactly one provider and the upstream error surfaces verbatim if that
provider fails. No silent failover, no cache hit masquerading as fresh.

Actionable 429 bodies

When all providers are exhausted, the gateway now returns a structured
body instead of a generic error:

{
  "error": {
    "type": "rate_limit_error",
    "code": "all_providers_exhausted",
    "message": "...",
    "retry_after_ms": 12000,
    "providers": [
      { "id": "groq",   "retry_after_ms": 12000, "keys_available": 0, "keys_total": 1, "circuit_state": "closed" },
      { "id": "gemini", "retry_after_ms": 5000,  "keys_available": 0, "keys_total": 1, "circuit_state": "closed" }
    ],
    "suggestions": [
      { "model": "free-fast",  "available_in_ms": 5000 },
      { "model": "free-smart", "available_in_ms": 5000 }
    ],
    "request_id": "..."
  }
}

The response also carries an HTTP Retry-After header in seconds.

Unified error SDK

New src/errors/ module defines the one and only error taxonomy the
gateway emits. Fifteen concrete error codes grouped into seven types
that match OpenAI's shape (invalid_request_error, authentication_error,
permission_error, not_found_error, rate_limit_error, provider_error,
internal_error). Every middleware delegates via next(freellmError(...))
instead of writing response bodies directly, and the central error
handler funnels everything through a single toBody() serializer.

freellmError({ code, message, ...context }) factory
httpStatusFor(code) and typeFor(code) lookup tables
toBody(err, requestId) never throws, falls back to
internal_server_error envelope for unknown input
redactSecrets(message) strips Bearer tokens, API-key-looking values,
and long hex sequences from error messages before they go on the wire

Request id propagation

New request-id middleware mounts first in the pipeline and assigns
every request a UUID (honors an inbound X-Request-Id matching
^[A-Za-z0-9_.:-]{1,128}$ so distributed traces can thread through).
The same id flows into the response header, the error body, and every
pino log line via genReqId so a single grep correlates access logs,
error logs, and bug reports.

Privacy and training-policy routing

New X-FreeLLM-Privacy: no-training header filters the router's
candidate list to providers that contractually exclude free-tier data
from training. Backed by a new PROVIDER_PRIVACY catalog with source
URLs and last-verified dates for every shipped provider:

Provider	Policy
Groq	no-training
Cerebras	no-training
NVIDIA NIM	no-training
Ollama	local
Mistral	configurable
Gemini	free-tier trains

When no provider can satisfy the posture for the requested model, the
gateway returns a 400 model_not_supported up front instead of
pointlessly cycling through the exclusion list. Server logs a warning
at boot for any catalog entry older than 90 days so operators re-verify
against the provider's current ToS.

Robust Retry-After handling

Upstream Retry-After headers are now parsed in both integer-seconds
and HTTP-date formats, clamped into [1s, 10min], and honored on 5xx
responses as well as 429s. Absurd values like 99999999 can no longer
lock a key out for years, and past HTTP dates floor to one second.

Per-identifier rate limiting

Every request can now carry an X-FreeLLM-Identifier header tagging it
with a logical identity (app user id, session token, anything that
fits ^[A-Za-z0-9_.:-]{1,128}$). The gateway tracks requests per
identifier in an independent sliding-window bucket. One noisy user
hitting their cap doesn't affect anyone else.

Configurable via FREELLM_IDENTIFIER_LIMIT=<max>/<windowMs>, default 60/60000
Hard ceiling of FREELLM_IDENTIFIER_MAX_BUCKETS distinct identifiers (default 10000) with LRU eviction on overflow
Idle buckets garbage-collected after 2x the window
Synchronous check-and-increment so concurrent requests cannot race
Missing header falls back to ip:<client-ip>
Literal "undefined" or "null" strings are treated as missing
Tainted values (control chars, spaces, too long) are rejected with a clear 400 instead of silently entering logs
Responses carry X-FreeLLM-Identifier, X-FreeLLM-Identifier-Remaining, and X-FreeLLM-Identifier-Reset so clients can self-throttle

Virtual sub-keys with soft caps

Operators can now declare virtual sub-keys in a JSON file pointed at by
FREELLM_VIRTUAL_KEYS_PATH. Each key can carry its own request cap,
token cap, model allowlist, and expiry:

{
  "keys": [
    {
      "id": "sk-freellm-portfolio-abc123",
      "label": "My portfolio site",
      "dailyRequestCap": 500,
      "dailyTokenCap": 200000,
      "allowedModels": ["free-fast", "free"],
      "expiresAt": "2026-07-01T00:00:00Z"
    }
  ]
}

The store is loaded at boot, Zod-validated, and rejects duplicate ids
and files larger than 1 MB. Virtual keys authenticate via
Authorization: Bearer sk-freellm-... alongside the existing
FREELLM_API_KEY master key. The chat route guards each request via
assertCanServe BEFORE routing to a provider (expiry, model allowlist,
request cap, token cap) and records usage AFTER a successful upstream
response, so failed routes never burn quota. Each cap hit returns its
own typed error: virtual_key_cap_reached, model_not_supported,
invalid_api_key.

Counters are in-memory, rolling 24 hours, reset on restart. This
is explicitly a soft cap (runaway-loop and abuse protection, not a
billing system). The server logs a loud warning at boot when any
virtual keys are loaded so operators cannot mistake it for billing.

Security, privacy, and benchmarks pages

Three new pages on the documentation website grouped under a new
Trust section:

/security lists the six direct production dependencies, what is
deliberately not in the codebase (no telemetry, no runtime code
generation, no plugin loaders, no install-time scripts), how to
verify a deployed Docker image, and where to report vulnerabilities
/privacy renders the provider training-policy catalog with links
to each provider's own terms of service
/benchmarks publishes cold-start and per-request overhead numbers
rendered from docs/benchmarks.json with a methodology section

Reproducible benchmark script

New scripts/bench.mjs spawns the built server against a fake
in-process upstream, measures boot time to first /healthz 200, then
runs cache-miss and cache-hit passes and writes docs/benchmarks.json.
Run it locally with node scripts/bench.mjs --print.

Reference numbers on a developer laptop:

Cold start: ~127 ms (spawn to first /healthz 200)
Cache-miss overhead: p50 0.69 ms, p99 1.37 ms
Cache-hit overhead: p50 0.34 ms, p99 0.92 ms

Continuous integration

New .github/workflows/ci.yml runs pnpm -r typecheck, the api-server
test suite, and pnpm audit --prod --audit-level=moderate on every
push and every pull request. Audit failures are tracked through
.github/audit-allowlist.json (a process contract, not an automated
bypass). Supports future badge wiring.

Test suite

FreeLLM now ships with 141 passing tests (up from 0) across eleven
test files, all green on every commit via CI:

errors.test.ts — exhaustive code-to-status and code-to-type coverage, factory, guard, serializer, and redact helpers
errors-integration.test.ts — X-Request-Id propagation, canonical shape on 400/401 paths
strict.test.ts — header parser, meta-model rejection
retry-advice.test.ts — per-provider and global earliest-retry math, hint ordering, suggestions
retry-after.test.ts — integer, fractional, HTTP-date, clamping at both ends, every invalid input
privacy.test.ts — header parsing, catalog exhaustiveness, satisfaction (including unknown-id fail-closed), staleness math
identifier-limiter.test.ts — sliding window, LRU, TTL, isolation, env parser
virtual-keys.test.ts — construction, duplicate rejection, expiry, allowedModels, rolling-window cap enforcement, file loading edge cases
router.test.ts — direct, failover, strict mode, privacy routing, Retry-After plumbing
e2e.test.ts — real Express app against a fake upstream, full header assertions
multi-tenant-e2e.test.ts — virtual key auth, cap enforcement, identifier middleware end-to-end

Changed

GatewayRouter.complete() now returns { data, meta } with full
route metadata (provider, resolvedModel, requestedModel, cached,
reason, attempted providers) instead of just the chat completion
The central errorHandler runs every error class through
`normalizeEr...

Assets 2

08 Apr 00:07

Devansh-365

v1.3.0

3b5a989

v1.3.0 — Response caching

Response caching — same prompt twice returns the cached response in ~23ms
with zero provider quota burn. Verified end-to-end at 9× faster than
the cold path (200ms → 23ms).

Added

In-memory LRU response cache

New ResponseCache class with sha256-keyed exact-match lookup
Cache key built from (model, messages, temperature, max_tokens, top_p, stop)
LRU eviction via Map re-insertion (recently-used entries stay at the end)
Per-entry TTL expiry (default 1 hour, configurable)
Default capacity 1000 entries (configurable)
Cache hits short-circuit the entire routing flow:
no provider call, no token quota burn, no rate limiter increment
Streaming requests are never cached (the SSE protocol is incompatible)
Errors are never cached (only successful 2xx responses)

Response markers

Cached responses include x_freellm_cached: true (alongside x_freellm_provider)
RequestLogEntry gained a cached?: boolean field
Token usage tracker is not incremented on cache hits (real cost = 0)

Cache stats on `/v1/status`

New cache field with full counters:

{
  "enabled": true,
  "ttlMs": 3600000,
  "maxEntries": 1000,
  "currentSize": 12,
  "hits": 47,
  "misses": 8,
  "sets": 8,
  "evictions": 0,
  "hitRate": 0.8545
}

Configuration

CACHE_ENABLED (default true) — set to false to disable
CACHE_TTL_MS (default 3600000 = 1 hour)
CACHE_MAX_ENTRIES (default 1000)

Dashboard

New 5th metrics card "Cache Hits" (cyan, Database icon) with hit-rate sub-line
Metrics row layout updated to 2/3/5 cols across mobile/medium/large breakpoints
Recent requests table shows a CACHE badge next to OK for cached rows

Why in-memory instead of SQLite

The original plan called for better-sqlite3, but it was rejected because:

Native compilation risk — better-sqlite3 needs node-gyp + Python +
build tools at install time. Railway's slim image likely lacks them, which
would break the published Railway template's build.
Ephemeral filesystem on free tiers — Railway and Render free tiers
don't have persistent disk. A SQLite cache file would be wiped on every
restart anyway, requiring a paid persistent volume.
Architectural consistency — every other observability piece in
FreeLLM (RequestLog, RateLimiter, CircuitBreaker, UsageTracker)
is in-memory. Adding DB-backed storage for one feature would break the
pattern.

Cold cache warms up in seconds, restart loss is acceptable for a free-tier
gateway, and the entire feature ships with zero new dependencies (uses
Node's built-in crypto.createHash). The ResponseCache class lives behind
a clean interface, so swapping the storage to SQLite later is a one-file
change if persistence becomes a priority.

Verified end-to-end

Call A (cold)             cached=false  latency=200ms   tokens=43+2  provider=groq
Call B (same)             cached=true   latency=23ms    tokens=0     no upstream
Call C (same)             cached=true   latency=23ms    tokens=0     no upstream
Call D (different prompt) cached=false  latency=~200ms  tokens=new   provider=groq

9× speedup, 50% hit rate after 4 calls, all 18 gateway tests still passing.

Assets 2

07 Apr 19:47

Devansh-365

v1.0.0

7384a92

v1.0.0 — First stable release

First stable release. Production-ready OpenAI-compatible gateway aggregating
6 free LLM providers with automatic failover, circuit breakers, and a
real-time dashboard.

Added

Gateway

OpenAI-compatible /v1/chat/completions endpoint with streaming and non-streaming support
6 LLM providers: Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama
25+ models across providers including Llama 3.3 70B, Gemini 2.5 Flash/Pro, Llama 4 Scout, Qwen3, Nemotron 70B, DeepSeek R1, GPT-OSS 120B
Three meta-models: free (round-robin), free-fast (latency-optimized), free-smart (capability-optimized)
Automatic failover across providers with configurable routing strategies (round-robin, random)
Per-provider circuit breakers with three states (closed → open → half-open) and configurable thresholds
Per-provider sliding-window rate limiting with conservative free-tier defaults
Per-client (per-IP) rate limiting via express-rate-limit
In-memory request log (last 500 requests) with stats and recent history
Routing deadline (ROUTE_TIMEOUT_MS) to prevent hung requests during cascading failures

Security

Optional API key authentication (FREELLM_API_KEY) using timing-safe SHA-256 comparison
Separate admin key (FREELLM_ADMIN_KEY) protecting circuit breaker reset and routing strategy mutations
Configurable CORS origins (ALLOWED_ORIGINS)
Body size limits on JSON and URL-encoded payloads
Zod schema validation with strict mode and bounded messages.max(256) / max_tokens.max(32768)
Upstream error sanitization (only safe message field forwarded, never raw upstream JSON)
Production warning when running without API key auth

Dashboard

React 18 + Vite + Tailwind SPA served by the same Express process in production
Real-time provider health cards (circuit breaker state, success/failure counts, last error)
Live request log with latency, status, model, and selected provider
Routing strategy toggle (round-robin / random)
Manual circuit breaker reset
Models page with search and grouping by provider
Mobile-responsive layout with slide-over menu
New FreeLLM logo as favicon and Open Graph image

Deployment

Multi-stage Dockerfile (Node 22 LTS, non-root appuser, healthcheck baked in)
docker-compose.yml for one-command local deployment
railway.json for Railway auto-detection with healthcheck and restart policy
Graceful shutdown on SIGTERM/SIGINT (drains in-flight requests, 8s deadline)
app.set("trust proxy", 1) for correct client IP behind reverse proxies
Static dashboard serving with SPA fallback for client-side routing
Production-ready logging via Pino with structured JSON output

Developer Experience

pnpm workspace monorepo with shared dependency catalog
TypeScript 5.9 across all packages with bundler module resolution
esbuild bundle for the API server with CJS shim for Pino compatibility
OpenAPI 3.1 spec as the single source of truth for the API client
Auto-generated React Query hooks via Orval (@workspace/api-client-react)
Knip configuration for unused export detection
scripts/test-gateway.sh end-to-end test suite with 18 checks (health, models, status, completions, streaming, NIM direct, validation)

Documentation

Comprehensive README with quickstart (Docker + local), provider table, API reference, security guide, and tech stack
Mermaid diagrams for request lifecycle, circuit breaker state machine, routing strategies, and high-level architecture
MIT license
Architecture refactor plan in docs/superpowers/plans/

Assets 2

Releases: Devansh-365/freellm

v1.5.1: Gemini reasoning fix, NIM json_schema, truncation warning

Fixed

Gemini 2.5 reasoning budget no longer eats max_tokens

Deprecated Gemini 2.0 models removed from catalog

Response cache no longer stores truncated responses

Cache key expanded to prevent cross-shape collisions

Added

NVIDIA NIM json_schema translation

JSON truncation warning header

reasoning_effort accepted in the request schema

finish_reason surfaced in the request log

Configuration

Tests

Uh oh!

v1.5.0: Browser-safe tokens and streaming correctness

Added

Browser-safe short-lived tokens

POST /v1/tokens/issue endpoint

Browser token authentication

Streaming tool_call normalization

Full OpenAI request shape accepted

Dashboard v1.5 surface

Browser integration docs and runnable example

Changed

Configuration

Tests

Uh oh!

v1.4.0: Honest Gateway

Added

Transparent routing headers

Strict mode

Actionable 429 bodies

Unified error SDK

Request id propagation

Privacy and training-policy routing

Robust Retry-After handling

Per-identifier rate limiting

Virtual sub-keys with soft caps

Security, privacy, and benchmarks pages

Reproducible benchmark script

Continuous integration

Test suite

Changed

Uh oh!

v1.3.0 — Response caching

Added

In-memory LRU response cache

Response markers

Cache stats on /v1/status

Configuration

Dashboard

Why in-memory instead of SQLite

Verified end-to-end

Uh oh!

v1.0.0 — First stable release

Added

Gateway

Security

Dashboard

Deployment

Developer Experience

Documentation

Uh oh!

`POST /v1/tokens/issue` endpoint

Cache stats on `/v1/status`