Skip to content

RFC: simplify inference-provider layering — endpoints, protocols, pluggable attestation #670

@Evrard-Nil

Description

@Evrard-Nil

Summary

Refactor the inference-provider stack so the three orthogonal concerns — wire protocol, attestation, and endpoint selection — become composable instead of welded together inside VLlmProvider.

Today VLlmProvider (crates/inference_providers/src/vllm/mod.rs, ~3,400 lines) bundles HTTP transport, TEE attestation, TLS-fingerprint pinning, 64-bucket prefix-cache routing, rotation-SNI fallback, sticky-chat connection pinning, and signature fetching. ExternalProvider is a thin facade that returns "not supported" for the TEE methods. Both implement a 14-method InferenceProvider god-trait. The pool (crates/services/src/inference_provider_pool/mod.rs, ~4,900 lines) duplicates routing logic on top.

This RFC proposes flattening that into:

  • Protocol (data-plane, today's ExternalBackend)
  • AttestationScheme (pluggable; NearAI today, room for AWS-Nitro / Azure-TEE / etc.)
  • Endpoint (URL + auth + protocol + optional attestation)
  • Model with Vec<Endpoint> — explicit n1/n2/n3 fan-out
  • Router that picks an endpoint and runs a protocol against it

Principles

  1. Endpoints are first-class. A model has Vec<Endpoint> (n1, n2, …). The router picks one; there is no in-provider rotation.
  2. Attestation is a property of an endpoint, not a provider class. Most paths don't care; the few that do (get_signature, get_attestation_report) ask endpoint.attestation.
  3. Wire protocol is orthogonal. OpenAI-compat / Anthropic / Gemini know how to convert params and stream chunks. They don't know about TEE, rotation, sticky routing, or LB.
  4. Selection lives in one place. Round-robin, sticky-by-chat-id, healthy-first — one router does all of it.
  5. The hot path is a small function: pick endpoint → run protocol → on error advance via retry policy.

New core types

// 1. WIRE PROTOCOL (rename of today's ExternalBackend, used by all paths).
trait Protocol: Send + Sync {
    fn name(&self) -> &'static str;
    async fn chat_completion_stream(&self, ep: &Endpoint, p: ChatCompletionParams, ctx: &ReqCtx)
        -> Result<StreamingResult, CompletionError>;
    async fn chat_completion(...) -> Result<..., CompletionError>;
    async fn image_generation(...) -> Result<..., ImageGenerationError> { default-error }
    async fn audio_transcription(...) -> Result<..., AudioTranscriptionError> { default-error }
    // etc.
}

// Impls: OpenAiCompatibleProtocol (with optional vllm-flavor toggle for
// X-Request-Hash + signing headers), AnthropicProtocol, GeminiProtocol.

// 2. ATTESTATION — pluggable.
trait AttestationScheme: Send + Sync {
    fn name(&self) -> &'static str;
    /// Connect, verify, return a TLS-pinned client. Cached by endpoint.
    async fn verify(&self, base_url: &str) -> Result<reqwest::Client, AttestationError>;
    async fn fetch_report(&self, client: &Client, ep: &Endpoint, q: ReportQuery)
        -> Option<Result<AttestationReport, AttestationError>> { None }
    async fn fetch_signature(&self, client: &Client, ep: &Endpoint, chat_id: &str, algo: &str)
        -> Option<Result<ChatSignature, CompletionError>> { None }
}

// Impls: NearAiAttestation (TDX + GPU evidence + image hash + SPKI pin).
// Open for: AwsNitroAttestation, AzureTeeAttestation, GcpConfidentialVm, etc.

// 3. ENDPOINT — one URL, one protocol, optionally attested.
struct Endpoint {
    id: EndpointId,
    url: String,
    auth: Auth,
    upstream_model_name: Option<String>,
    timeouts: Timeouts,
    protocol: Arc<dyn Protocol>,
    attestation: Option<Arc<dyn AttestationScheme>>,
    client: OnceCell<reqwest::Client>,
}

impl Endpoint {
    async fn client(&self) -> Result<&Client, CompletionError> {
        self.client.get_or_try_init(|| async {
            match &self.attestation {
                Some(scheme) => scheme.verify(&self.url).await.map_err(into_completion),
                None => Ok(plain_http_client(&self.timeouts)),
            }
        }).await
    }
}

// 4. MODEL — canonical id + endpoints + selection policy.
struct Model {
    canonical_name: String,
    endpoints: Vec<Arc<Endpoint>>,
    selection: SelectionPolicy,
}

// 5. ROUTER — what InferenceProviderPool shrinks into.
struct Router {
    models: RwLock<HashMap<String, Arc<Model>>>,
    sticky: RwLock<HashMap<ChatId, Arc<Endpoint>>>,
    health: HealthTracker,
    retry: RetryPolicy,
}

impl Router {
    async fn run<R>(&self, model: &str, sticky: Option<&str>, op: Op<R>)
        -> Result<(R, Arc<Endpoint>), CompletionError>;
}

Hot path

let (stream, ep) = router.run(&model, sticky_chat_id, |ep, ctx|
    ep.protocol.chat_completion_stream(ep, params.clone(), ctx)
).await?;
// peek first chunk to learn chat_id, then router.sticky_pin(chat_id, ep)

Signature / attestation path

let ep = router.sticky_get(chat_id).ok_or(NotFound)?;
let scheme = ep.attestation.as_ref().ok_or(NotAttested)?;
scheme.fetch_signature(ep.client().await?, ep, chat_id, algo).await
    .ok_or(NotSupportedByScheme)??

The 14-method InferenceProvider trait at crates/inference_providers/src/lib.rs:170 goes away. Callers talk to the router; the router talks to Protocol and AttestationScheme.

Decisions already locked in

These were resolved during design discussion — see [internal thread]:

Question Decision
Prefix-cache bucket routing Drop it. Hash the prefix to one of the N explicit endpoints. The 64-bucket trie was a workaround for "only one URL to the SNI proxy." Revisit only if prefix-cache hit-rate regresses on production traffic.
Model-proxy in front of NEAR AI endpoints Keep model-proxy, one endpoint per host behind it. Endpoints look like n1.completions.near.ai, n2.completions.near.ai, …; model-proxy routes by hostname (not rotation index). cloud-api connects to a known endpoint, no SNI rotation gymnastics.
Schema New model_endpoints table. One row per (model, endpoint). Clean queries, easy per-endpoint flags (active, region, priority).
Rollout Design-only RFC for now. Implementation phasing to be decided after design review.

Mapping current → new

Today Tomorrow
InferenceProvider god-trait (14 methods, lib.rs:170) Split: Protocol (data-plane) + AttestationScheme (TEE).
VLlmProvider (~3,400 lines, vllm/mod.rs) Removed. Replaced by: NearAiAttestation scheme + standard OpenAiCompatibleProtocol + endpoints with attestation: Some(NearAiAttestation).
ExternalProvider facade (external/mod.rs:141) Removed. Backends become Protocol impls used directly by the router.
OpenAiCompatibleBackend / AnthropicBackend / GeminiBackend Renamed to …Protocol. Kept almost as-is.
PoolBackendVerifier (inference_provider_pool/mod.rs:225) Renamed to NearAiAttestation. Same body, no longer wired through VLlmProvider::new_with_verifier.
VLlmConfig two-timeout setup Endpoint::timeouts (per-endpoint), with model-level / global defaults.
vllm/prefix_router.rs (trie + 64 buckets) Deleted (per decision above).
Rotation-SNI machinery: rotation_url, try_chat_completion_rotation, pending_rotation, signature_rotation, last_backend_count, build_rotation_client, set_backend_count Deleted. Endpoints are explicit.
pending_buckets / signature_buckets Collapses into router.sticky: HashMap<ChatId, Arc<Endpoint>>.
pin_chat_connection / unpin_chat_connection on the trait Becomes router.sticky_pin / router.sticky_release. Not a trait method.
InferenceProviderPool::retry_with_fallback (mod.rs:1478) Router::run — same idea, iterates model.endpoints directly instead of Vec<Arc<dyn InferenceProvider>>.
model_to_providers + pubkey_to_providers models: HashMap<String, Arc<Model>>. Pubkey routing becomes a filter step: `model.endpoints.iter().filter(
provider_failure_counts keyed by Arc::as_ptr HealthTracker keyed by stable EndpointId.
strip_internal_tracing_keys (external/mod.rs:58), prepare_tracing_headers, prepare_encryption_headers Tracing IDs travel in ReqCtx, not in params.extra. The fragile #[serde(flatten)] strip-list disappears.
models.inference_url single column New model_endpoints table (model_id, url, position, protocol, attestation, auth_ref, active).
Discovery (discover_model, apply_pin_update, complete-coverage logic) Per-endpoint re-verification on a periodic interval + reconcile-from-DB. No "walk SNI by index to enumerate backends."

Expected reduction

Rough estimate from current file sizes:

  • vllm/mod.rs: ~3,400 → ~400 lines (NearAiAttestation + small vllm-flavor toggle on the OpenAI-compat protocol)
  • vllm/prefix_router.rs: deleted
  • external/mod.rs: ~860 → small (facade gone, protocols kept)
  • inference_provider_pool/mod.rs: ~4,900 → ~1,000 lines (Router + HealthTracker + discovery reconcile)
  • BackendVerifier trait + lazy bucket dance in VLlmProviderEndpoint::client() + OnceCell

What goes wrong if we don't do this

  • Adding a new attestation scheme (e.g., AWS-Nitro) today means subclassing VLlmProvider or duplicating its 3,400 lines.
  • Adding fan-out across model hosts requires a model-proxy rotation-SNI dance plus per-call rotation index tracking in vllm/mod.rs.
  • Pool retry logic is tangled with attestation lifecycle (PoolBackendVerifier is constructed against a specific VLlmProvider), making it hard to test either in isolation.
  • ChatCompletionParams.extra smuggles tracing/encryption keys that must be carefully stripped before forwarding to external providers — easy to leak.

Open questions for the design review

  1. Per-endpoint signing pubkey caching. Today the pool fetches both ECDSA and Ed25519 attestations to populate pubkey_to_providers. In the new design this lives on Endpoint as signing_pubkeys: HashMap<Algo, String>, populated on first verify. Confirm that's the right shape.
  2. Retry policy granularity. Should retry policy be per-model (some models can't retry because they're stateful) or global? Today it's global with per-error classification.
  3. Sticky map eviction. Today there's no TTL — entries accumulate. Should the new sticky map have a TTL or a max size?
  4. E2EE pubkey routing. Verify that pubkey-based provider selection still works correctly when the same model has multiple endpoints with potentially different signing keys (post key-rotation).
  5. Health-tracker thresholds. Today "≥10 consecutive failures" demotes a provider. Keep that constant, or make it per-endpoint configurable?

Non-goals

  • Changing the public OpenAI-compatible API surface — /v1/chat/completions request/response shapes stay identical.
  • Changing the attestation algorithm — NearAiAttestation is byte-for-byte the same TDX+GPU+image-hash verification we do today.
  • Replacing the SSE parser, the InterceptStream (TTFT/ITL/usage tracking), or the completion-service orchestration.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestrfcDesign/architecture RFC

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions