RFC: simplify inference-provider layering — endpoints, protocols, pluggable attestation

## Summary

Refactor the inference-provider stack so the three orthogonal concerns — **wire protocol**, **attestation**, and **endpoint selection** — become composable instead of welded together inside `VLlmProvider`.

Today `VLlmProvider` (`crates/inference_providers/src/vllm/mod.rs`, ~3,400 lines) bundles HTTP transport, TEE attestation, TLS-fingerprint pinning, 64-bucket prefix-cache routing, rotation-SNI fallback, sticky-chat connection pinning, and signature fetching. `ExternalProvider` is a thin facade that returns \"not supported\" for the TEE methods. Both implement a 14-method `InferenceProvider` god-trait. The pool (`crates/services/src/inference_provider_pool/mod.rs`, ~4,900 lines) duplicates routing logic on top.

This RFC proposes flattening that into:

- `Protocol` (data-plane, today's `ExternalBackend`)
- `AttestationScheme` (pluggable; NearAI today, room for AWS-Nitro / Azure-TEE / etc.)
- `Endpoint` (URL + auth + protocol + optional attestation)
- `Model` with `Vec<Endpoint>` — explicit n1/n2/n3 fan-out
- `Router` that picks an endpoint and runs a protocol against it

## Principles

1. **Endpoints are first-class.** A model has `Vec<Endpoint>` (n1, n2, …). The router picks one; there is no in-provider rotation.
2. **Attestation is a property of an endpoint**, not a provider class. Most paths don't care; the few that do (`get_signature`, `get_attestation_report`) ask `endpoint.attestation`.
3. **Wire protocol is orthogonal.** OpenAI-compat / Anthropic / Gemini know how to convert params and stream chunks. They don't know about TEE, rotation, sticky routing, or LB.
4. **Selection lives in one place.** Round-robin, sticky-by-chat-id, healthy-first — one router does all of it.
5. **The hot path is a small function**: pick endpoint → run protocol → on error advance via retry policy.

## New core types

```rust
// 1. WIRE PROTOCOL (rename of today's ExternalBackend, used by all paths).
trait Protocol: Send + Sync {
    fn name(&self) -> &'static str;
    async fn chat_completion_stream(&self, ep: &Endpoint, p: ChatCompletionParams, ctx: &ReqCtx)
        -> Result<StreamingResult, CompletionError>;
    async fn chat_completion(...) -> Result<..., CompletionError>;
    async fn image_generation(...) -> Result<..., ImageGenerationError> { default-error }
    async fn audio_transcription(...) -> Result<..., AudioTranscriptionError> { default-error }
    // etc.
}

// Impls: OpenAiCompatibleProtocol (with optional vllm-flavor toggle for
// X-Request-Hash + signing headers), AnthropicProtocol, GeminiProtocol.

// 2. ATTESTATION — pluggable.
trait AttestationScheme: Send + Sync {
    fn name(&self) -> &'static str;
    /// Connect, verify, return a TLS-pinned client. Cached by endpoint.
    async fn verify(&self, base_url: &str) -> Result<reqwest::Client, AttestationError>;
    async fn fetch_report(&self, client: &Client, ep: &Endpoint, q: ReportQuery)
        -> Option<Result<AttestationReport, AttestationError>> { None }
    async fn fetch_signature(&self, client: &Client, ep: &Endpoint, chat_id: &str, algo: &str)
        -> Option<Result<ChatSignature, CompletionError>> { None }
}

// Impls: NearAiAttestation (TDX + GPU evidence + image hash + SPKI pin).
// Open for: AwsNitroAttestation, AzureTeeAttestation, GcpConfidentialVm, etc.

// 3. ENDPOINT — one URL, one protocol, optionally attested.
struct Endpoint {
    id: EndpointId,
    url: String,
    auth: Auth,
    upstream_model_name: Option<String>,
    timeouts: Timeouts,
    protocol: Arc<dyn Protocol>,
    attestation: Option<Arc<dyn AttestationScheme>>,
    client: OnceCell<reqwest::Client>,
}

impl Endpoint {
    async fn client(&self) -> Result<&Client, CompletionError> {
        self.client.get_or_try_init(|| async {
            match &self.attestation {
                Some(scheme) => scheme.verify(&self.url).await.map_err(into_completion),
                None => Ok(plain_http_client(&self.timeouts)),
            }
        }).await
    }
}

// 4. MODEL — canonical id + endpoints + selection policy.
struct Model {
    canonical_name: String,
    endpoints: Vec<Arc<Endpoint>>,
    selection: SelectionPolicy,
}

// 5. ROUTER — what InferenceProviderPool shrinks into.
struct Router {
    models: RwLock<HashMap<String, Arc<Model>>>,
    sticky: RwLock<HashMap<ChatId, Arc<Endpoint>>>,
    health: HealthTracker,
    retry: RetryPolicy,
}

impl Router {
    async fn run<R>(&self, model: &str, sticky: Option<&str>, op: Op<R>)
        -> Result<(R, Arc<Endpoint>), CompletionError>;
}
```

### Hot path

```rust
let (stream, ep) = router.run(&model, sticky_chat_id, |ep, ctx|
    ep.protocol.chat_completion_stream(ep, params.clone(), ctx)
).await?;
// peek first chunk to learn chat_id, then router.sticky_pin(chat_id, ep)
```

### Signature / attestation path

```rust
let ep = router.sticky_get(chat_id).ok_or(NotFound)?;
let scheme = ep.attestation.as_ref().ok_or(NotAttested)?;
scheme.fetch_signature(ep.client().await?, ep, chat_id, algo).await
    .ok_or(NotSupportedByScheme)??
```

The 14-method `InferenceProvider` trait at `crates/inference_providers/src/lib.rs:170` **goes away**. Callers talk to the router; the router talks to `Protocol` and `AttestationScheme`.

## Decisions already locked in

These were resolved during design discussion — see [internal thread]:

| Question | Decision |
|---|---|
| Prefix-cache bucket routing | **Drop it.** Hash the prefix to one of the N explicit endpoints. The 64-bucket trie was a workaround for \"only one URL to the SNI proxy.\" Revisit only if prefix-cache hit-rate regresses on production traffic. |
| Model-proxy in front of NEAR AI endpoints | **Keep model-proxy, one endpoint per host behind it.** Endpoints look like `n1.completions.near.ai`, `n2.completions.near.ai`, …; model-proxy routes by hostname (not rotation index). cloud-api connects to a known endpoint, no SNI rotation gymnastics. |
| Schema | **New `model_endpoints` table.** One row per (model, endpoint). Clean queries, easy per-endpoint flags (active, region, priority). |
| Rollout | **Design-only RFC for now.** Implementation phasing to be decided after design review. |

## Mapping current → new

| Today | Tomorrow |
|---|---|
| `InferenceProvider` god-trait (14 methods, `lib.rs:170`) | Split: `Protocol` (data-plane) + `AttestationScheme` (TEE). |
| `VLlmProvider` (~3,400 lines, `vllm/mod.rs`) | Removed. Replaced by: `NearAiAttestation` scheme + standard `OpenAiCompatibleProtocol` + endpoints with `attestation: Some(NearAiAttestation)`. |
| `ExternalProvider` facade (`external/mod.rs:141`) | Removed. Backends become `Protocol` impls used directly by the router. |
| `OpenAiCompatibleBackend` / `AnthropicBackend` / `GeminiBackend` | Renamed to `…Protocol`. Kept almost as-is. |
| `PoolBackendVerifier` (`inference_provider_pool/mod.rs:225`) | Renamed to `NearAiAttestation`. Same body, no longer wired through `VLlmProvider::new_with_verifier`. |
| `VLlmConfig` two-timeout setup | `Endpoint::timeouts` (per-endpoint), with model-level / global defaults. |
| `vllm/prefix_router.rs` (trie + 64 buckets) | **Deleted** (per decision above). |
| Rotation-SNI machinery: `rotation_url`, `try_chat_completion_rotation`, `pending_rotation`, `signature_rotation`, `last_backend_count`, `build_rotation_client`, `set_backend_count` | **Deleted.** Endpoints are explicit. |
| `pending_buckets` / `signature_buckets` | Collapses into `router.sticky: HashMap<ChatId, Arc<Endpoint>>`. |
| `pin_chat_connection` / `unpin_chat_connection` on the trait | Becomes `router.sticky_pin` / `router.sticky_release`. Not a trait method. |
| `InferenceProviderPool::retry_with_fallback` (`mod.rs:1478`) | `Router::run` — same idea, iterates `model.endpoints` directly instead of `Vec<Arc<dyn InferenceProvider>>`. |
| `model_to_providers` + `pubkey_to_providers` | `models: HashMap<String, Arc<Model>>`. Pubkey routing becomes a filter step: `model.endpoints.iter().filter(|ep| ep.signing_pubkey() == requested)`. |
| `provider_failure_counts` keyed by `Arc::as_ptr` | `HealthTracker` keyed by stable `EndpointId`. |
| `strip_internal_tracing_keys` (`external/mod.rs:58`), `prepare_tracing_headers`, `prepare_encryption_headers` | Tracing IDs travel in `ReqCtx`, not in `params.extra`. The fragile `#[serde(flatten)]` strip-list disappears. |
| `models.inference_url` single column | New `model_endpoints` table (model_id, url, position, protocol, attestation, auth_ref, active). |
| Discovery (`discover_model`, `apply_pin_update`, complete-coverage logic) | Per-endpoint re-verification on a periodic interval + reconcile-from-DB. No \"walk SNI by index to enumerate backends.\" |

## Expected reduction

Rough estimate from current file sizes:

- `vllm/mod.rs`: ~3,400 → ~400 lines (`NearAiAttestation` + small vllm-flavor toggle on the OpenAI-compat protocol)
- `vllm/prefix_router.rs`: deleted
- `external/mod.rs`: ~860 → small (facade gone, protocols kept)
- `inference_provider_pool/mod.rs`: ~4,900 → ~1,000 lines (Router + HealthTracker + discovery reconcile)
- `BackendVerifier` trait + lazy bucket dance in `VLlmProvider` → `Endpoint::client()` + `OnceCell`

## What goes wrong if we don't do this

- Adding a new attestation scheme (e.g., AWS-Nitro) today means subclassing `VLlmProvider` or duplicating its 3,400 lines.
- Adding fan-out across model hosts requires a model-proxy rotation-SNI dance plus per-call rotation index tracking in `vllm/mod.rs`.
- Pool retry logic is tangled with attestation lifecycle (`PoolBackendVerifier` is constructed against a specific `VLlmProvider`), making it hard to test either in isolation.
- `ChatCompletionParams.extra` smuggles tracing/encryption keys that must be carefully stripped before forwarding to external providers — easy to leak.

## Open questions for the design review

1. **Per-endpoint signing pubkey caching.** Today the pool fetches both ECDSA and Ed25519 attestations to populate `pubkey_to_providers`. In the new design this lives on `Endpoint` as `signing_pubkeys: HashMap<Algo, String>`, populated on first verify. Confirm that's the right shape.
2. **Retry policy granularity.** Should retry policy be per-model (some models can't retry because they're stateful) or global? Today it's global with per-error classification.
3. **Sticky map eviction.** Today there's no TTL — entries accumulate. Should the new `sticky` map have a TTL or a max size?
4. **E2EE pubkey routing.** Verify that pubkey-based provider selection still works correctly when the same model has multiple endpoints with potentially different signing keys (post key-rotation).
5. **Health-tracker thresholds.** Today \"≥10 consecutive failures\" demotes a provider. Keep that constant, or make it per-endpoint configurable?

## Non-goals

- Changing the public OpenAI-compatible API surface — `/v1/chat/completions` request/response shapes stay identical.
- Changing the attestation algorithm — `NearAiAttestation` is byte-for-byte the same TDX+GPU+image-hash verification we do today.
- Replacing the SSE parser, the InterceptStream (TTFT/ITL/usage tracking), or the completion-service orchestration.

## References

- Current pool: `crates/services/src/inference_provider_pool/mod.rs`
- Current vLLM provider: `crates/inference_providers/src/vllm/mod.rs`
- Current external facade: `crates/inference_providers/src/external/mod.rs`
- Trait definition: `crates/inference_providers/src/lib.rs:170`
- Related: #587 (Bootstrap state during startup), #600 (rotation-SNI pre-warm), #573 (model substitution / attested name mismatch)

Today	Tomorrow
`InferenceProvider` god-trait (14 methods, `lib.rs:170`)	Split: `Protocol` (data-plane) + `AttestationScheme` (TEE).
`VLlmProvider` (~3,400 lines, `vllm/mod.rs`)	Removed. Replaced by: `NearAiAttestation` scheme + standard `OpenAiCompatibleProtocol` + endpoints with `attestation: Some(NearAiAttestation)`.
`ExternalProvider` facade (`external/mod.rs:141`)	Removed. Backends become `Protocol` impls used directly by the router.
`OpenAiCompatibleBackend` / `AnthropicBackend` / `GeminiBackend`	Renamed to `…Protocol`. Kept almost as-is.
`PoolBackendVerifier` (`inference_provider_pool/mod.rs:225`)	Renamed to `NearAiAttestation`. Same body, no longer wired through `VLlmProvider::new_with_verifier`.
`VLlmConfig` two-timeout setup	`Endpoint::timeouts` (per-endpoint), with model-level / global defaults.
`vllm/prefix_router.rs` (trie + 64 buckets)	Deleted (per decision above).
Rotation-SNI machinery: `rotation_url`, `try_chat_completion_rotation`, `pending_rotation`, `signature_rotation`, `last_backend_count`, `build_rotation_client`, `set_backend_count`	Deleted. Endpoints are explicit.
`pending_buckets` / `signature_buckets`	Collapses into `router.sticky: HashMap<ChatId, Arc<Endpoint>>`.
`pin_chat_connection` / `unpin_chat_connection` on the trait	Becomes `router.sticky_pin` / `router.sticky_release`. Not a trait method.
`InferenceProviderPool::retry_with_fallback` (`mod.rs:1478`)	`Router::run` — same idea, iterates `model.endpoints` directly instead of `Vec<Arc<dyn InferenceProvider>>`.
`model_to_providers` + `pubkey_to_providers`	`models: HashMap<String, Arc<Model>>`. Pubkey routing becomes a filter step: `model.endpoints.iter().filter(
`provider_failure_counts` keyed by `Arc::as_ptr`	`HealthTracker` keyed by stable `EndpointId`.
`strip_internal_tracing_keys` (`external/mod.rs:58`), `prepare_tracing_headers`, `prepare_encryption_headers`	Tracing IDs travel in `ReqCtx`, not in `params.extra`. The fragile `#[serde(flatten)]` strip-list disappears.
`models.inference_url` single column	New `model_endpoints` table (model_id, url, position, protocol, attestation, auth_ref, active).
Discovery (`discover_model`, `apply_pin_update`, complete-coverage logic)	Per-endpoint re-verification on a periodic interval + reconcile-from-DB. No "walk SNI by index to enumerate backends."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: simplify inference-provider layering — endpoints, protocols, pluggable attestation #670

Summary

Principles

New core types

Hot path

Signature / attestation path

Decisions already locked in

Mapping current → new

Expected reduction

What goes wrong if we don't do this

Open questions for the design review

Non-goals

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question	Decision
Prefix-cache bucket routing	Drop it. Hash the prefix to one of the N explicit endpoints. The 64-bucket trie was a workaround for "only one URL to the SNI proxy." Revisit only if prefix-cache hit-rate regresses on production traffic.
Model-proxy in front of NEAR AI endpoints	Keep model-proxy, one endpoint per host behind it. Endpoints look like `n1.completions.near.ai`, `n2.completions.near.ai`, …; model-proxy routes by hostname (not rotation index). cloud-api connects to a known endpoint, no SNI rotation gymnastics.
Schema	New `model_endpoints` table. One row per (model, endpoint). Clean queries, easy per-endpoint flags (active, region, priority).
Rollout	Design-only RFC for now. Implementation phasing to be decided after design review.

RFC: simplify inference-provider layering — endpoints, protocols, pluggable attestation #670

Description

Summary

Principles

New core types

Hot path

Signature / attestation path

Decisions already locked in

Mapping current → new

Expected reduction

What goes wrong if we don't do this

Open questions for the design review

Non-goals

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions