| layout | default |
|---|---|
| title | Model Gateway Primitives RFC |
Status: implemented for the core runtime. Alloy context-window
safety, named [[cascades]], named [[dispatchers]], and the shared
estimator trait are in code. char_ratio, byte_ratio, and optional
tiktoken-rs estimators are implemented through [proxy.token_estimator].
capacity_fraction and per-model/per-primitive estimator overrides remain
future work.
| # | Question | Resolution |
|---|---|---|
| 1 | Name for the size-routing primitive | dispatcher ("router" too generic) |
| 2 | Cascade as a named primitive | Yes — own [[cascades]] table |
| 3 | Safety margin default | Two knobs — estimator safety_margin (default 1.10) AND per-model capacity_fraction (default 1.0; users lower to e.g. 0.85 when a model degrades near its ceiling). Composition formula and rationale in the updated section below. |
| 4 | Per-primitive tokenizer override + second tokenizer impl | Global estimator config ships first. CharRatioEstimator, ByteRatioEstimator, and optional TiktokenEstimator are wired into routing. SentencePieceEstimator and per-model/per-primitive overrides are deferred. |
| 5 | Re-evaluation default for dispatchers | per_turn (re-evaluate each message — never dies from size). sticky as opt-in for flows where model-voice continuity matters. sticky_escalate as a middle-ground convenience (sticky, permit one auto-promotion on ceiling, then sticky at the new tier). worst_case advanced opt-in with required growth prior. |
| 6 | Back-compat: allow missing context_window on alloy constituents |
No — required field. Prototype phase, all installations owned in-house. Forcing size declaration at config load prevents silent truncation forever; trivial one-time config edit. |
| 7 | Dispatcher rule semantics + capacity_fraction interaction | Default: "first target whose effective ceiling fits the request." No max_input_tokens thresholds needed for the common case. capacity_fraction lives on each model individually and feeds the effective-ceiling computation. Explicit when.max_input_tokens rules remain available for non-size routing (cost tier, agent-id, etc.). |
Today calciforge has one model-blending primitive — alloy — and an implicit on-error fallback behavior. That's not enough:
- Alloys assume constituents are interchangeable (any can serve any request). That assumption breaks when constituents have different context windows — e.g., local Qwen 3.5 (32K) + Kimi K2.6 (262K).
- Without size-awareness, a 100K-token request can be silently routed to a 32K model and truncated, losing data with no error.
This RFC proposes:
- Three clearly-scoped primitives, each with a distinct purpose:
[[alloys]]— blend between equivalent models (implemented, with context-window safety)[[cascades]]— try in order, fall through on error (implemented as a named primitive)[[dispatchers]]— pick by request shape (implemented for size-first routing)
- A shared
TokenEstimatortrait used by all three primitives to reason about whether a request "fits" a model. Default: configurable chars-per-token heuristic. Pluggable: real tokenizers (tiktoken,sentencepiece). min_context_windowsafety assertion at alloy-build time, so silent-truncation footguns fail loudly.
The user wrote recently:
I think we also want an alloy that hybridizes local and kimi 2.6 models to see if we can best of both and avoid hitting limits on kimi while still leveraging local compute and hopefully getting results nearly as good as kimi… though not sure how context window sizes will work in alloys, maybe we never addressed that?
This is a real use case — "most requests are small and fit local; occasional large ones need Kimi" — but forcing it into an alloy is a dead-end. Alloys do random weighted sampling between constituents. That's meaningful when constituents are ~equivalent and the blend expresses a cost/quality preference. It breaks when one constituent can't serve certain requests at all.
The real abstraction the user wants is: "choose smallest-sufficient model per request." That's not sampling — it's routing.
Cascade picks primary, falls through to secondary on failure. If primary is 200K-context Claude and secondary is 32K Qwen, a 100K request that makes it past the primary (say Claude is rate-limited at second call) falls through to Qwen and silently loses 70% of the context.
All three primitives need to respect context-window math.
Purpose: cost/quality blending via sampling. 80% fast + 20% smart = blended average.
Assumption: constituents are interchangeable. New requirement: they must have compatible context windows.
New config fields:
[[alloys]]
id = "fast-smart-blend"
name = "Fast + Smart Blend"
strategy = "weighted"
# Effective ceiling for the alloy. If not set, auto-computed as min(constituents.context_window).
# Requests above this ceiling are rejected at alloy level (loudly, not silently).
min_context_window = 200000
[[alloys.constituents]]
model = "gemini-2.5-flash"
context_window = 1048576
weight = 80
[[alloys.constituents]]
model = "claude-haiku-4-6"
context_window = 200000
weight = 20Validation: at AlloyProvider::from_config(), error if any constituent's declared context_window < min_context_window. Catches the "I didn't mean to put a 32K and a 262K in the same alloy" footgun at config-load time.
Runtime check: when request arrives, estimate tokens (via TokenEstimator) and reject with clear error if estimate > min_context_window. Never truncate silently.
Purpose: reliability. Try primary, on timeout/5xx/429 try secondary.
Today's behavior: there is no fallbacks field on AlloyConfig. Fallback
is implicit — AlloyProvider::select_plan() returns an ordered_models: Vec<String>
listing every constituent as a potential fallback, and the proxy iterates them
in route_with_fallback() until one succeeds. Order within that list is
deterministic for round_robin (rotating from the last selected index)
but varies per request for weighted (weighted sampling without
replacement). That "all constituents of the alloy are also fallbacks" pattern
is what the cascade primitive promotes to its own named construct, with
the important distinction that cascade ordering is always deterministic
(declaration order):
[[cascades]]
id = "kimi-with-fallback"
# First success wins.
#
# Cascade is TRIGGERED by errors (timeout, 5xx, 429) — it does not treat
# "request too large" as a retry condition. But before each step is
# attempted, the runtime pre-checks that the request fits that step's
# context_window and SKIPS unfit steps (with a warning log) rather than
# letting the model return an error. Think of it as: ineligibility is
# cheap to detect up front, so we do; actual errors are what cascade
# retries exist for.
[[cascades.steps]]
model = "opencode-go/kimi-k2.6"
context_window = 262144
[[cascades.steps]]
model = "kimi-for-coding" # Moonshot
context_window = 128000
[[cascades.steps]]
model = "local/qwen3.5-35b" # last resort, much smaller
context_window = 32768Runtime behavior: before trying step N, estimate request tokens; skip to step N+1 if request doesn't fit. Track which steps were attempted for telemetry. Fail with clear error if no step can serve.
Discussion open: should cascade skip-on-size be silent, or emit a warning per downgrade? Recommendation: warning log at INFO level per skipped step; final error if everything skipped.
Purpose: route requests to the smallest-sufficient model (or, future: by other properties).
Default behavior: ordered list of targets; first target whose effective ceiling can hold the request wins. No thresholds to maintain — the size check uses each target's own declared context_window × capacity_fraction.
[[dispatchers]]
id = "kimi-smart"
reevaluate = "per_turn"
# Try in order. First target whose effective ceiling fits the request wins.
# Error if no target fits.
targets = [
"local/qwen3.5-35b", # effective ≈ 24,576
"opencode-go/kimi-k2.6", # effective ≈ 222,822
"gemini-2.5-flash", # effective ≈ 996,147
]The effective ceiling is context_window × capacity_fraction, computed per-model. Adding or removing a model from the list doesn't require re-computing thresholds — the model's own declaration drives the fit check.
Targets can be models OR other primitives:
targets = [
"local/qwen3.5-35b",
"alloy/claude-gemini-200k", # for requests that fit the alloy's effective ceiling
"gemini-2.5-flash",
]Explicit rules for non-size decisions (advanced — cost tier, agent id, request content, time of day):
[[dispatchers]]
id = "cost-aware"
# When explicit rules are present, they override the default fit-first-target behavior.
# Rules evaluated in declared order; first match wins.
[[dispatchers.rules]]
when.max_input_tokens = 10000 # hand-set floor, tighter than capacity_fraction would imply
target = "cheap-model"
[[dispatchers.rules]]
fits_target = true # fall back to implicit fit check against the target's effective ceiling
target = "expensive-model"This composition is how the "kimi + local hybrid" goal is expressed safely:
- Most requests fit local (free, fast, ≈24K effective). Above, Kimi.
- No silent truncation, no wasted Kimi calls on small requests.
"Router" is overloaded in software (content routing, URL routing, network routing). "Dispatcher" is more specific: picks which backend handles this request.
Alternatives considered:
| Name | Pro | Con |
|---|---|---|
router |
Familiar | Too generic; "routing" overloaded |
dispatcher ⭐ |
Clear "pick target per request" semantics | A little programmer-jargony |
tier / tiers |
Captures size-laddering | Presumes size is the only axis |
fit / fitter |
Short, clear for size case | Obscure for non-size rules |
selector |
Generic | Also used in k8s/CSS; overloaded |
picker |
Folksy, clear | Informal; unconventional |
bucket-router |
Explicit | Compound, awkward |
Going with dispatcher pending feedback.
All three primitives need to answer: "does this request's input fit in model X's context window?" That requires a token estimate. We want:
- A sane default that works out-of-the-box
- Configurable tuning for power users
- An extension point for real tokenizers when accuracy matters
/// Estimate the token count of a prompt for the purpose of fit-checking.
///
/// Implementations SHOULD be conservative (over-estimate slightly) so that
/// fit-checks have headroom — under-estimation is a silent-truncation risk,
/// over-estimation at worst forces a fallthrough to a bigger model.
pub trait TokenEstimator: Send + Sync {
/// Estimate tokens for a plain-text prompt (excludes tool definitions).
fn estimate_text(&self, text: &str) -> usize;
/// Estimate tokens for a chat request (messages + optional tool definitions).
/// Default impl sums per-message + fixed overhead; override for accuracy.
fn estimate_chat(&self, messages: &[Message], tools: &[ToolDef]) -> usize {
// naive default: sum of text estimates + per-message framing overhead
let msg_tokens: usize = messages.iter().map(|m| self.estimate_text(&m.content)).sum();
let tool_tokens: usize = tools.iter().map(|t| self.estimate_text(&t.schema_json)).sum();
let framing = messages.len() * 4; // role markers, separators
msg_tokens + tool_tokens + framing
}
}pub struct CharRatioEstimator {
pub chars_per_token: f32, // default 3.5 (English-prose-biased)
pub safety_margin: f32, // default 1.10 (overstate by 10%)
}
impl Default for CharRatioEstimator {
fn default() -> Self {
Self { chars_per_token: 3.5, safety_margin: 1.10 }
}
}
impl TokenEstimator for CharRatioEstimator {
fn estimate_text(&self, text: &str) -> usize {
let chars = text.chars().count() as f32;
(chars / self.chars_per_token * self.safety_margin).ceil() as usize
}
}Rationale for default values:
- 3.5 chars/token — common average for English prose with GPT-family tokenizers. Code and Chinese text are denser (~2.5 and ~1.5 respectively). Under-estimating for code/non-English is the risk case, hence the safety margin.
- 10% safety margin — covers most under-estimates for prose and moderate code. Not sufficient on its own for CJK-heavy prompts, where actual tokens can run >2× a 3.5-chars/token estimate even after the margin. For those workloads, either (a) override
chars_per_tokento a denser ratio (e.g., 1.8), (b) raisesafety_marginsubstantially (2.0+), or (c) use the Tiktoken estimator (feature flag) where an exact BPE count eliminates the guesswork. The CharRatio defaults are safe for the English-first deployments this RFC targets; anything heavier should tune.
Both fields are configurable, see "Config surface" below.
The original draft conflated two things. They're separate concerns with different defaults and scopes:
Knob A — estimator safety_margin (multiplier on the estimate):
"I might under-count tokens because my heuristic is approximate."
- Multiplicative factor applied to the estimate itself
- Default
1.10for char-ratio; a real tokenizer like tiktoken can use1.02since it's accurate to ~1% - Belongs to the estimator; each estimator implementation carries its own default
- Tighter → risks silent truncation; looser → wastes headroom
Knob B — model capacity_fraction (multiplier on the declared window):
"Even if I knew the exact count, some models degrade near their ceiling. Don't push them there."
- Multiplicative factor applied to the model's declared
context_window - Default
1.0(use the full declared window). Users lower it per-model when they see quality drop-off. - Per-model because behavior-near-ceiling varies (e.g., Claude reportedly holds up to ~95% of ceiling; some models noticeably degrade past ~70%)
- Users who say "I want a much higher buffer because I've noticed dumbing" set
capacity_fraction = 0.7— clean separation from the estimator concern.
Fit-check composition:
Convention: TokenEstimator::estimate_* returns a conservative count
with safety_margin already applied (see the CharRatioEstimator::estimate_text
impl above — the * self.safety_margin happens inside). Callers never
multiply by safety_margin a second time. The per-model capacity_fraction
is applied once, on the declared context_window, to derive an "effective
ceiling". The fit check is then a direct comparison:
estimate = TokenEstimator::estimate_*(...) // already margin-applied
ceiling = model.context_window * model.capacity_fraction
Rejected if: estimate > ceiling
Worked example (prose so the formula stays the single source of truth):
- Raw input measured at roughly 180k tokens. Estimator applies its internal 10% margin, so the returned estimate is about 198k.
- Model has a declared context window of 262_144 tokens; the operator
has set
capacity_fractionto 0.85, giving an effective ceiling near 222,822. - 198k < 222k, so the request is accepted.
- If we had double-applied the margin (caller multiplying again after the estimator already did), we would compare roughly 217,800 against the same 222,822 — still under, but needlessly close.
Dispatcher rule language uses "effective ceiling" to mean
context_window × capacity_fraction consistently.
Config:
[proxy.token_estimator]
strategy = "auto" # auto, char_ratio, byte_ratio, or tiktoken
# tokenizer = "o200k_base" # optional tiktoken base override
chars_per_token = 3.5
safety_margin = 1.10 # estimator knob
[[models]]
id = "kimi-k2.6"
context_window = 262144
capacity_fraction = 0.85 # avoid top 15% where Kimi reportedly degrades
[[models]]
id = "claude-sonnet-4-6"
context_window = 200000
capacity_fraction = 0.95 # Claude holds up closer to ceiling
[[models]]
id = "local/qwen3.5-35b"
context_window = 32768
capacity_fraction = 0.75 # user has observed noticeable drop past 24K// OpenAI-compatible BPE count via optional `tiktoken-rs` feature.
pub struct TiktokenEstimator {
bpe: &'static tiktoken_rs::CoreBPE,
}
// SentencePiece for Llama-family models
pub struct SentencePieceEstimator { /* ... */ }Users opt in by configuring a non-default estimator. We ship
CharRatioEstimator and ByteRatioEstimator in the default build; the
OpenAI-compatible tokenizer is gated behind
--features tiktoken-estimator to keep default build dependencies light.
Different models have wildly different token/char ratios:
| Model family | Rough chars/token |
|---|---|
| GPT-4 (English prose) | ~4 |
| GPT-4 (code) | ~2.5 |
| Claude | ~3.7 |
| Llama/Qwen | ~3.0 |
| Chinese text | ~1.5 |
So chars_per_token should be overridable per-model:
[model_defaults]
chars_per_token = 3.5
safety_margin = 1.10
[[models]]
id = "qwen3.5-35b"
context_window = 32768
chars_per_token = 3.0 # Qwen tokenizer tends denser
[[models]]
id = "kimi-k2.6"
context_window = 262144
chars_per_token = 2.8 # Chinese-English mixed, code-heavyNote: the
[tokenizer],[model_defaults], and[[models]]sections below are proposed additions toCalciforgeConfig. They do not exist in the current schema and will be added as part of the implementation of this RFC. A schema version bump is expected; existing configs stay valid without them (resolution falls through to built-in defaults).
Global default in top-level config:
[tokenizer]
kind = "char_ratio" # "char_ratio" | "tiktoken" | "sentencepiece"
chars_per_token = 3.5
safety_margin = 1.10Per-primitive override (in an alloy / cascade / dispatcher):
[[dispatchers]]
id = "smart"
[dispatchers.tokenizer]
kind = "tiktoken"
encoding = "cl100k_base"Per-model override: as shown above. Takes precedence over per-primitive and global.
Resolution order: per-model > per-primitive > global > built-in default.
The primitives are composable. Common patterns:
Pattern 1: size-tier that blends within tiers
[[alloys]]
id = "claude-gemini-200k"
min_context_window = 200000
# … 200K-safe blend
[[dispatchers]]
id = "smart"
[[dispatchers.rules]]
when.max_input_tokens = 30000
target = "local/qwen3.5-35b"
[[dispatchers.rules]]
when.max_input_tokens = 180000
target = "alloy/claude-gemini-200k" # small-enough for our 200K blend
[[dispatchers.rules]]
when.max_input_tokens = 900000
target = "gemini-2.5-flash"Pattern 2: dispatcher in front of cascade
[[cascades]]
id = "kimi-or-fallback"
# Assumes caller fits within narrowest member — paired with a dispatcher for safety
[[cascades.steps]]
model = "opencode-go/kimi-k2.6"
[[cascades.steps]]
model = "kimi-for-coding"
[[dispatchers]]
id = "with-safety"
[[dispatchers.rules]]
when.max_input_tokens = 125000 # narrowest cascade member is 128K Moonshot
target = "cascade/kimi-or-fallback"
[[dispatchers.rules]]
when.max_input_tokens = 250000
target = "opencode-go/kimi-k2.6" # direct, Moonshot can't fitRule: cascades are not size-safe on their own — they must be used at a level where all members can serve the incoming request, OR wrapped in a dispatcher.
A dispatcher picks at request time. By message 20, the cumulative context may have grown past the initially-chosen model's ceiling.
Options:
- (a) Sticky: once picked, stay. Errors late if conversation grows past ceiling.
- (b) Re-evaluate per turn: dispatcher runs every turn. Session continuity breaks when the tier changes (different model, different "memory"). Bad UX.
- (c) Always pick for session worst-case: dispatcher uses an estimate of "this session's max context," picks a ceiling that covers it. Conservative, wastes smaller-model opportunities.
Decision: default to per_turn. Chat APIs are stateless; re-picking per message mechanically works. For task-completion flows (calciforge's main use case) the cost of an occasional model swap is lower than the cost of a session that dies at a ceiling.
[[dispatchers]]
id = "smart"
reevaluate = "per_turn" # default — re-pick each message
# reevaluate = "sticky" # pick once, error on ceiling (for voice-continuity flows)
# reevaluate = "sticky_escalate" # sticky, auto-promote once on ceiling, then sticky at new tier
# reevaluate = "worst_case" # advanced — requires growth prior below
# assume_session_max_tokens = 100000Pragmatic default for most flows: sticky_escalate is arguably the sweet spot (stability most of the session, graceful upgrade when needed). Listed as an opt-in for now since per_turn is simplest and always works.
Function definitions + tool results add tokens not present in the user's message. A list_files tool definition is ~100 tokens; a directory listing result can be thousands.
Addressed by the estimate_chat method taking tools explicitly, and by default safety margin. Power users with tool-heavy flows should bump safety_margin (a multiplier — try 1.15 or 1.20, i.e. +15–20%, rather than the default 1.10).
Correction to an earlier draft: for most providers the context window bounds input_tokens + output_tokens, not just input. A request that fits on input can still overflow at generation time if max_tokens is large. Our TokenEstimator measures input only, so the fit-check must reserve headroom for the output budget. Implementation detail: primitive runtime compares estimate(input) + max_tokens_for_request against effective_ceiling, not just estimate(input). Callers who don't set max_tokens explicitly must supply a default output budget (e.g., 4K) so the check isn't silently bypassed.
Reasoning tokens add output cost but are usually not in the input context. Estimator doesn't need to account for them.
All sizes are in tokens. Not characters, not bytes. Config authors can use K suffix for readability; K means * 1024 (binary convention):
context_window = 262144 # tokens (2^18)
# equivalently:
context_window = "256K" # parsed as 256 * 1024 = 262144A literal "262K" would parse as 262 * 1024 = 268288, not 262144 — use "256K" if you want the 262144 value.
- Alloy: not applicable. Alloy constituents REQUIRE
context_window(validated atAlloyProvider::from_config, and0is rejected explicitly). No silent-truncation path. - Dispatcher: rule targets are either sized (matchable) or "any" (catch-all, always matches last). Unsized rules only make sense as catch-alls.
- Cascade: unsized steps are always considered eligible (no pre-skip); the step is attempted and errors surface as normal cascade failures.
Alloy constituents now REQUIRE context_window. No back-compat for missing fields. Prototype phase, all installations owned in-house — fixing existing config files is a one-time edit; the upside (no silent truncation, ever) is worth breaking the schema. min_context_window on the alloy stays optional and auto-computes as min(constituent.context_window) when not specified.
Existing config files needing updates:
- Any deployed
/etc/calciforge/config.tomlusing[[alloys]] - Any user-level Calciforge config using
[[alloys]]
Migration done in the same PR that introduces the required field.
Cascades today are implicit inside alloy's fallbacks behavior. This RFC promotes them to a named [[cascades]] primitive. Transition: existing alloy-with-fallbacks behavior becomes sugar for alloy-wrapped-in-cascade; we keep the sugar so existing configs don't break.
Dispatchers are new. Opt-in.
TokenEstimator is implemented. Added as a global config under
[proxy.token_estimator]. If unconfigured, strategy = "auto" uses
tiktoken-rs when compiled in and recognized, otherwise the char-ratio
fallback.
In-scope:
- Design of the three primitives and their config shape
- TokenEstimator trait + default implementation
- Safety rules (alloy min_context_window, cascade skip-on-size, dispatcher exhaustion error)
- Composition rules
- Migration guarantees
Out of scope (follow-up work):
- Per-model/per-primitive estimator overrides
- SentencePiece or provider-specific tokenizers beyond tiktoken
- Plugging
tiktoken-rsor other real tokenizers (separate PR; trait is there, default is enough for now) - Per-session routing memory (noted in edge case #1; follow-up if needed)
- Docs: this will be captured in README + a dedicated
docs/model-gateway.mdwhen implementation lands
All initial review questions have been resolved. Remaining open items surface during implementation:
capacity_fractiondefaults per known model family — need empirical data. Initial defaults:1.0(no derate) until a model proves it needs less. Claude ~0.95, Kimi ~0.85 recommended starting points in docs, not hardcoded.- Tiktoken encoding selection — does the config need to specify
cl100k_basevso200k_baseper model, or pick a sensible default? Open; likely per-model override. - Dispatcher fit-check evaluation cost — for very long histories, char-counting on every turn is cheap but non-zero. Measure once implementation lands; add a caching layer only if it shows up in profiles.
- Small safety PR — add
context_windowtoAlloyConstituentConfig,min_context_windowtoAlloyConfig, validation inAlloyProvider::new(). No runtime behavior change beyond rejecting bad configs at startup. (Already scoped as task #23.) - Add
TokenEstimatortrait +CharRatioEstimatorimpl. Wire into alloy for runtime request-fit rejection. - Add
[[cascades]]named primitive with same fit-check semantics. - Add
[[dispatchers]]withmax_input_tokensrules. - README updates +
docs/model-gateway.mdauthoritative reference.
Each as a focused PR. RFC is the long-form design; PRs execute the plan.