Summary
Refactor the inference-provider stack so the three orthogonal concerns — wire protocol, attestation, and endpoint selection — become composable instead of welded together inside VLlmProvider.
Today VLlmProvider (crates/inference_providers/src/vllm/mod.rs, ~3,400 lines) bundles HTTP transport, TEE attestation, TLS-fingerprint pinning, 64-bucket prefix-cache routing, rotation-SNI fallback, sticky-chat connection pinning, and signature fetching. ExternalProvider is a thin facade that returns "not supported" for the TEE methods. Both implement a 14-method InferenceProvider god-trait. The pool (crates/services/src/inference_provider_pool/mod.rs, ~4,900 lines) duplicates routing logic on top.
This RFC proposes flattening that into:
Protocol (data-plane, today's ExternalBackend)
AttestationScheme (pluggable; NearAI today, room for AWS-Nitro / Azure-TEE / etc.)
Endpoint (URL + auth + protocol + optional attestation)
Model with Vec<Endpoint> — explicit n1/n2/n3 fan-out
Router that picks an endpoint and runs a protocol against it
Principles
- Endpoints are first-class. A model has
Vec<Endpoint> (n1, n2, …). The router picks one; there is no in-provider rotation.
- Attestation is a property of an endpoint, not a provider class. Most paths don't care; the few that do (
get_signature, get_attestation_report) ask endpoint.attestation.
- Wire protocol is orthogonal. OpenAI-compat / Anthropic / Gemini know how to convert params and stream chunks. They don't know about TEE, rotation, sticky routing, or LB.
- Selection lives in one place. Round-robin, sticky-by-chat-id, healthy-first — one router does all of it.
- The hot path is a small function: pick endpoint → run protocol → on error advance via retry policy.
New core types
// 1. WIRE PROTOCOL (rename of today's ExternalBackend, used by all paths).
trait Protocol: Send + Sync {
fn name(&self) -> &'static str;
async fn chat_completion_stream(&self, ep: &Endpoint, p: ChatCompletionParams, ctx: &ReqCtx)
-> Result<StreamingResult, CompletionError>;
async fn chat_completion(...) -> Result<..., CompletionError>;
async fn image_generation(...) -> Result<..., ImageGenerationError> { default-error }
async fn audio_transcription(...) -> Result<..., AudioTranscriptionError> { default-error }
// etc.
}
// Impls: OpenAiCompatibleProtocol (with optional vllm-flavor toggle for
// X-Request-Hash + signing headers), AnthropicProtocol, GeminiProtocol.
// 2. ATTESTATION — pluggable.
trait AttestationScheme: Send + Sync {
fn name(&self) -> &'static str;
/// Connect, verify, return a TLS-pinned client. Cached by endpoint.
async fn verify(&self, base_url: &str) -> Result<reqwest::Client, AttestationError>;
async fn fetch_report(&self, client: &Client, ep: &Endpoint, q: ReportQuery)
-> Option<Result<AttestationReport, AttestationError>> { None }
async fn fetch_signature(&self, client: &Client, ep: &Endpoint, chat_id: &str, algo: &str)
-> Option<Result<ChatSignature, CompletionError>> { None }
}
// Impls: NearAiAttestation (TDX + GPU evidence + image hash + SPKI pin).
// Open for: AwsNitroAttestation, AzureTeeAttestation, GcpConfidentialVm, etc.
// 3. ENDPOINT — one URL, one protocol, optionally attested.
struct Endpoint {
id: EndpointId,
url: String,
auth: Auth,
upstream_model_name: Option<String>,
timeouts: Timeouts,
protocol: Arc<dyn Protocol>,
attestation: Option<Arc<dyn AttestationScheme>>,
client: OnceCell<reqwest::Client>,
}
impl Endpoint {
async fn client(&self) -> Result<&Client, CompletionError> {
self.client.get_or_try_init(|| async {
match &self.attestation {
Some(scheme) => scheme.verify(&self.url).await.map_err(into_completion),
None => Ok(plain_http_client(&self.timeouts)),
}
}).await
}
}
// 4. MODEL — canonical id + endpoints + selection policy.
struct Model {
canonical_name: String,
endpoints: Vec<Arc<Endpoint>>,
selection: SelectionPolicy,
}
// 5. ROUTER — what InferenceProviderPool shrinks into.
struct Router {
models: RwLock<HashMap<String, Arc<Model>>>,
sticky: RwLock<HashMap<ChatId, Arc<Endpoint>>>,
health: HealthTracker,
retry: RetryPolicy,
}
impl Router {
async fn run<R>(&self, model: &str, sticky: Option<&str>, op: Op<R>)
-> Result<(R, Arc<Endpoint>), CompletionError>;
}
Hot path
let (stream, ep) = router.run(&model, sticky_chat_id, |ep, ctx|
ep.protocol.chat_completion_stream(ep, params.clone(), ctx)
).await?;
// peek first chunk to learn chat_id, then router.sticky_pin(chat_id, ep)
Signature / attestation path
let ep = router.sticky_get(chat_id).ok_or(NotFound)?;
let scheme = ep.attestation.as_ref().ok_or(NotAttested)?;
scheme.fetch_signature(ep.client().await?, ep, chat_id, algo).await
.ok_or(NotSupportedByScheme)??
The 14-method InferenceProvider trait at crates/inference_providers/src/lib.rs:170 goes away. Callers talk to the router; the router talks to Protocol and AttestationScheme.
Decisions already locked in
These were resolved during design discussion — see [internal thread]:
| Question |
Decision |
| Prefix-cache bucket routing |
Drop it. Hash the prefix to one of the N explicit endpoints. The 64-bucket trie was a workaround for "only one URL to the SNI proxy." Revisit only if prefix-cache hit-rate regresses on production traffic. |
| Model-proxy in front of NEAR AI endpoints |
Keep model-proxy, one endpoint per host behind it. Endpoints look like n1.completions.near.ai, n2.completions.near.ai, …; model-proxy routes by hostname (not rotation index). cloud-api connects to a known endpoint, no SNI rotation gymnastics. |
| Schema |
New model_endpoints table. One row per (model, endpoint). Clean queries, easy per-endpoint flags (active, region, priority). |
| Rollout |
Design-only RFC for now. Implementation phasing to be decided after design review. |
Mapping current → new
| Today |
Tomorrow |
InferenceProvider god-trait (14 methods, lib.rs:170) |
Split: Protocol (data-plane) + AttestationScheme (TEE). |
VLlmProvider (~3,400 lines, vllm/mod.rs) |
Removed. Replaced by: NearAiAttestation scheme + standard OpenAiCompatibleProtocol + endpoints with attestation: Some(NearAiAttestation). |
ExternalProvider facade (external/mod.rs:141) |
Removed. Backends become Protocol impls used directly by the router. |
OpenAiCompatibleBackend / AnthropicBackend / GeminiBackend |
Renamed to …Protocol. Kept almost as-is. |
PoolBackendVerifier (inference_provider_pool/mod.rs:225) |
Renamed to NearAiAttestation. Same body, no longer wired through VLlmProvider::new_with_verifier. |
VLlmConfig two-timeout setup |
Endpoint::timeouts (per-endpoint), with model-level / global defaults. |
vllm/prefix_router.rs (trie + 64 buckets) |
Deleted (per decision above). |
Rotation-SNI machinery: rotation_url, try_chat_completion_rotation, pending_rotation, signature_rotation, last_backend_count, build_rotation_client, set_backend_count |
Deleted. Endpoints are explicit. |
pending_buckets / signature_buckets |
Collapses into router.sticky: HashMap<ChatId, Arc<Endpoint>>. |
pin_chat_connection / unpin_chat_connection on the trait |
Becomes router.sticky_pin / router.sticky_release. Not a trait method. |
InferenceProviderPool::retry_with_fallback (mod.rs:1478) |
Router::run — same idea, iterates model.endpoints directly instead of Vec<Arc<dyn InferenceProvider>>. |
model_to_providers + pubkey_to_providers |
models: HashMap<String, Arc<Model>>. Pubkey routing becomes a filter step: `model.endpoints.iter().filter( |
provider_failure_counts keyed by Arc::as_ptr |
HealthTracker keyed by stable EndpointId. |
strip_internal_tracing_keys (external/mod.rs:58), prepare_tracing_headers, prepare_encryption_headers |
Tracing IDs travel in ReqCtx, not in params.extra. The fragile #[serde(flatten)] strip-list disappears. |
models.inference_url single column |
New model_endpoints table (model_id, url, position, protocol, attestation, auth_ref, active). |
Discovery (discover_model, apply_pin_update, complete-coverage logic) |
Per-endpoint re-verification on a periodic interval + reconcile-from-DB. No "walk SNI by index to enumerate backends." |
Expected reduction
Rough estimate from current file sizes:
vllm/mod.rs: ~3,400 → ~400 lines (NearAiAttestation + small vllm-flavor toggle on the OpenAI-compat protocol)
vllm/prefix_router.rs: deleted
external/mod.rs: ~860 → small (facade gone, protocols kept)
inference_provider_pool/mod.rs: ~4,900 → ~1,000 lines (Router + HealthTracker + discovery reconcile)
BackendVerifier trait + lazy bucket dance in VLlmProvider → Endpoint::client() + OnceCell
What goes wrong if we don't do this
- Adding a new attestation scheme (e.g., AWS-Nitro) today means subclassing
VLlmProvider or duplicating its 3,400 lines.
- Adding fan-out across model hosts requires a model-proxy rotation-SNI dance plus per-call rotation index tracking in
vllm/mod.rs.
- Pool retry logic is tangled with attestation lifecycle (
PoolBackendVerifier is constructed against a specific VLlmProvider), making it hard to test either in isolation.
ChatCompletionParams.extra smuggles tracing/encryption keys that must be carefully stripped before forwarding to external providers — easy to leak.
Open questions for the design review
- Per-endpoint signing pubkey caching. Today the pool fetches both ECDSA and Ed25519 attestations to populate
pubkey_to_providers. In the new design this lives on Endpoint as signing_pubkeys: HashMap<Algo, String>, populated on first verify. Confirm that's the right shape.
- Retry policy granularity. Should retry policy be per-model (some models can't retry because they're stateful) or global? Today it's global with per-error classification.
- Sticky map eviction. Today there's no TTL — entries accumulate. Should the new
sticky map have a TTL or a max size?
- E2EE pubkey routing. Verify that pubkey-based provider selection still works correctly when the same model has multiple endpoints with potentially different signing keys (post key-rotation).
- Health-tracker thresholds. Today "≥10 consecutive failures" demotes a provider. Keep that constant, or make it per-endpoint configurable?
Non-goals
- Changing the public OpenAI-compatible API surface —
/v1/chat/completions request/response shapes stay identical.
- Changing the attestation algorithm —
NearAiAttestation is byte-for-byte the same TDX+GPU+image-hash verification we do today.
- Replacing the SSE parser, the InterceptStream (TTFT/ITL/usage tracking), or the completion-service orchestration.
References
Summary
Refactor the inference-provider stack so the three orthogonal concerns — wire protocol, attestation, and endpoint selection — become composable instead of welded together inside
VLlmProvider.Today
VLlmProvider(crates/inference_providers/src/vllm/mod.rs, ~3,400 lines) bundles HTTP transport, TEE attestation, TLS-fingerprint pinning, 64-bucket prefix-cache routing, rotation-SNI fallback, sticky-chat connection pinning, and signature fetching.ExternalProvideris a thin facade that returns "not supported" for the TEE methods. Both implement a 14-methodInferenceProvidergod-trait. The pool (crates/services/src/inference_provider_pool/mod.rs, ~4,900 lines) duplicates routing logic on top.This RFC proposes flattening that into:
Protocol(data-plane, today'sExternalBackend)AttestationScheme(pluggable; NearAI today, room for AWS-Nitro / Azure-TEE / etc.)Endpoint(URL + auth + protocol + optional attestation)ModelwithVec<Endpoint>— explicit n1/n2/n3 fan-outRouterthat picks an endpoint and runs a protocol against itPrinciples
Vec<Endpoint>(n1, n2, …). The router picks one; there is no in-provider rotation.get_signature,get_attestation_report) askendpoint.attestation.New core types
Hot path
Signature / attestation path
The 14-method
InferenceProvidertrait atcrates/inference_providers/src/lib.rs:170goes away. Callers talk to the router; the router talks toProtocolandAttestationScheme.Decisions already locked in
These were resolved during design discussion — see [internal thread]:
n1.completions.near.ai,n2.completions.near.ai, …; model-proxy routes by hostname (not rotation index). cloud-api connects to a known endpoint, no SNI rotation gymnastics.model_endpointstable. One row per (model, endpoint). Clean queries, easy per-endpoint flags (active, region, priority).Mapping current → new
InferenceProvidergod-trait (14 methods,lib.rs:170)Protocol(data-plane) +AttestationScheme(TEE).VLlmProvider(~3,400 lines,vllm/mod.rs)NearAiAttestationscheme + standardOpenAiCompatibleProtocol+ endpoints withattestation: Some(NearAiAttestation).ExternalProviderfacade (external/mod.rs:141)Protocolimpls used directly by the router.OpenAiCompatibleBackend/AnthropicBackend/GeminiBackend…Protocol. Kept almost as-is.PoolBackendVerifier(inference_provider_pool/mod.rs:225)NearAiAttestation. Same body, no longer wired throughVLlmProvider::new_with_verifier.VLlmConfigtwo-timeout setupEndpoint::timeouts(per-endpoint), with model-level / global defaults.vllm/prefix_router.rs(trie + 64 buckets)rotation_url,try_chat_completion_rotation,pending_rotation,signature_rotation,last_backend_count,build_rotation_client,set_backend_countpending_buckets/signature_bucketsrouter.sticky: HashMap<ChatId, Arc<Endpoint>>.pin_chat_connection/unpin_chat_connectionon the traitrouter.sticky_pin/router.sticky_release. Not a trait method.InferenceProviderPool::retry_with_fallback(mod.rs:1478)Router::run— same idea, iteratesmodel.endpointsdirectly instead ofVec<Arc<dyn InferenceProvider>>.model_to_providers+pubkey_to_providersmodels: HashMap<String, Arc<Model>>. Pubkey routing becomes a filter step: `model.endpoints.iter().filter(provider_failure_countskeyed byArc::as_ptrHealthTrackerkeyed by stableEndpointId.strip_internal_tracing_keys(external/mod.rs:58),prepare_tracing_headers,prepare_encryption_headersReqCtx, not inparams.extra. The fragile#[serde(flatten)]strip-list disappears.models.inference_urlsingle columnmodel_endpointstable (model_id, url, position, protocol, attestation, auth_ref, active).discover_model,apply_pin_update, complete-coverage logic)Expected reduction
Rough estimate from current file sizes:
vllm/mod.rs: ~3,400 → ~400 lines (NearAiAttestation+ small vllm-flavor toggle on the OpenAI-compat protocol)vllm/prefix_router.rs: deletedexternal/mod.rs: ~860 → small (facade gone, protocols kept)inference_provider_pool/mod.rs: ~4,900 → ~1,000 lines (Router + HealthTracker + discovery reconcile)BackendVerifiertrait + lazy bucket dance inVLlmProvider→Endpoint::client()+OnceCellWhat goes wrong if we don't do this
VLlmProvideror duplicating its 3,400 lines.vllm/mod.rs.PoolBackendVerifieris constructed against a specificVLlmProvider), making it hard to test either in isolation.ChatCompletionParams.extrasmuggles tracing/encryption keys that must be carefully stripped before forwarding to external providers — easy to leak.Open questions for the design review
pubkey_to_providers. In the new design this lives onEndpointassigning_pubkeys: HashMap<Algo, String>, populated on first verify. Confirm that's the right shape.stickymap have a TTL or a max size?Non-goals
/v1/chat/completionsrequest/response shapes stay identical.NearAiAttestationis byte-for-byte the same TDX+GPU+image-hash verification we do today.References
crates/services/src/inference_provider_pool/mod.rscrates/inference_providers/src/vllm/mod.rscrates/inference_providers/src/external/mod.rscrates/inference_providers/src/lib.rs:170