Skip to content

Commit 84367f0

Browse files
authored
feat: TCP diversity + cumulative attestation discovery (#544)
* feat: TCP diversity + cumulative attestation discovery On staging, PR #543 eliminated the permanent provider-block deadlock, but revealed a second issue: every model registered with pinned_fingerprints=1 regardless of backend count. reqwest negotiates HTTP/2 over its warm connection pool, so 5 staggered calls on one Client multiplex onto a single TCP connection, hash to a single L4 backend, and only one fingerprint gets pinned. When model-proxy later L4-hashes a request to a different backend, TLS is rejected → provider fails → transient 502 while PR #537's path evicts and recreates the provider from scratch. Also noticed GLM-5/GLM-5.1 occasionally registering with pub_keys=0: the separate post-pin pubkey fetch was making a second pair of round- trips on the pinned connection and failing, leaving E2EE routing broken even though the provider served non-E2EE traffic fine. Three changes: 1. TCP diversity — each discovery call gets its own VLlmProvider (fresh reqwest::Client, fresh TCP connection). Peers share the same fingerprint_state Arc so pin updates propagate. With N separate TCP handshakes, each hashes independently through L4 → high chance of covering every backend within a few cycles. 2. Extract pubkeys at discovery time — the attestation report already contains signing_public_key for the requested algo. The discovery futures now alternate ecdsa/ed25519 and harvest pubkeys from every verified response. Since pubkeys are derived from the TEE compose hash they're identical across a model's backends, so first verified response per algo wins. Eliminates the separate post-pin fetch. 3. Cumulative discovery — each refresh cycle runs a small number (CUMULATIVE_DISCOVERY_CALLS=2) of additional discovery calls against every reused provider. New fingerprints accumulate into the shared FingerprintState; pubkeys backfill providers that lost their mapping. Over several cycles this covers every backend behind the L4 LB even when initial discovery only hit one. Also: - Evict reused providers whose fingerprint_state has transitioned to Blocked — a Blocked verifier rejects every handshake, so cumulative discovery can't recover in-place; eviction forces a fresh Bootstrap provider next cycle. - Legacy refetch path (for reused providers with no tracked fingerprint state, e.g. test-injected mocks) preserved as a fallback so #537's regression test still passes. - Structured logs per discovery call with algo, attempt index, elapsed_ms, and outcome (success/error/timeout) at debug level; per-cycle summary at info with successful/failed call counts, new_fingerprints, total_pinned, and harvested pubkey_algos. Tests: 26 pool tests pass (was 25), including a new test that verifies a Blocked fingerprint state causes eviction from all three tracking maps without making any network calls. * address review: parallelize cumulative discovery, drop unused field Three fixes from @claude's review on PR #544: 1. Parallelize cumulative discovery across reused providers. The loop was awaiting each `discover_model` call sequentially; with CUMULATIVE_DISCOVERY_CALLS=2 the per-provider cost is ~10.2s worst case (stagger + per-call timeout), so N providers sequentially could add minutes to a refresh cycle and starve the background task. Classify each reused provider upfront (Blocked/Discover/LegacyRefetch/ Skip), then drive the Discover and LegacyRefetch buckets through `buffer_unordered(10)` concurrently via `tokio::join!`. Evictions from the Blocked path are collected synchronously (no network). Collection uses an explicit `while let Some(x) = stream.next().await` loop rather than `.collect::<Vec<_>>()` because the collector's HRTB inference trips on tuples containing `Arc<dyn InferenceProvider>`. 2. Move `attestation_reports.push(report)` out — unverified reports were being collected into the same Vec as verified ones. 3. Drop `DiscoveryOutcome::attestation_reports` entirely. Every caller discarded it; the field was `#[allow(dead_code)]` and just added allocation and noise. Fixes (2) implicitly. * address review: fix self-blocking, lightweight client, proper pruning Addresses feedback from @gemini-code-assist and @copilot on PR #544. ## Critical correctness ### Self-blocking during discovery (Gemini) Discovery calls shared the caller's `FingerprintState` Arc. The first call that completed transitioned the state from `Bootstrap` to `Pinned({A})` — subsequent calls to different backends (exactly the purpose of TCP diversity) then had their TLS handshakes rejected inside `SpkiFingerprintVerifier` for not matching {A}. Explains why staging showed `pinned_fingerprints=1` for every model even with the fresh-client path in place. Each call now uses its own isolated `Bootstrap` state for the TLS verifier. Verified fingerprints are merged into the caller's shared accumulator *after* the HTTP call returns, outside the per-call future. No cross-call interference. ### `known_ptrs` mutation race (Copilot) `let has = !known_ptrs.insert(ptr)` mutated the set as we iterated `reused`. When one provider Arc backed multiple models, the second model's iteration wrongly saw "already mapped" because the first iteration had just inserted the pointer. Backfill was then skipped for the second model. Replaced with an immutable `mapped_ptrs` snapshot (used for membership checks) and a separate `processed_ptrs` set (used only to deduplicate work per unique provider Arc). ## Resource efficiency ### 645 reqwest::Client per discovery pass (Gemini) Each `VLlmProvider::new*` spins up 128 `bucket_clients` plus a main client. With 5 discovery calls per model that's 645 reqwest clients spun up and dropped per model per refresh just to make 5 HTTP GETs. Discovery now uses a minimal `reqwest::Client` built directly from a shared `SharedTlsRoots` (loaded once at pool creation, not per call) and a custom `SpkiFingerprintVerifier`. ~1 client per call. Also adds `tls_roots: SharedTlsRoots` field on `InferenceProviderPool` so the native-cert DER parsing (~150KB) doesn't run every refresh. ## Correctness ### Per-algo pubkey backfill (Copilot) Previous gate was binary: "provider has ANY pubkey mapping → skip backfill". If initial discovery only produced one algo's pubkey (or a prior cycle partially populated), the missing algo would never get backfilled. Now always attempts to merge harvested pubkeys, deduplicating at the `(pubkey, provider_ptr)` pair level against an immutable snapshot of `pubkey_to_providers`. Partial mappings self-heal on the next cycle. ### Stale `inference_url_fingerprint_states` entries (Copilot, Gemini) The states map was appended to but never pruned when a model URL dropped out of the active set. Small leak that would grow over long uptimes as models rotate. After the URL cache is updated, prune the states map to match (`retain(|url, _| active_urls.contains_key(url))`). ### Evicted providers left in `pubkey_to_providers` (Copilot) Eviction removed entries from `model_to_providers`, `inference_url_providers`, and the states map — but not from `pubkey_to_providers`. Evicted provider Arcs stayed alive via their pubkey mapping, and future `mapped_ptrs` snapshots would wrongly see them as "registered" providers. Eviction now tracks evicted provider pointers and prunes any `pubkey_to_providers` entries referencing them in the same atomic write. ## Test plan - [x] `cargo test -p services --lib` — 196 pass - [x] `cargo fmt --all -- --check` — clean - [x] `cargo clippy -p services --lib -- -D warnings` — clean
1 parent 30a33c3 commit 84367f0

4 files changed

Lines changed: 805 additions & 200 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

crates/services/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ description.workspace = true
1010
async-trait = "0.1"
1111
serde = { version = "1.0", features = ["derive"] }
1212
serde_json = "1.0"
13+
serde_urlencoded = "0.7"
1314
uuid = { version = "1.22", features = ["v4", "v5", "serde"] }
1415
opentelemetry = { version = "0.31", features = ["metrics"] }
1516
opentelemetry_sdk = { version = "0.31", features = ["rt-tokio", "metrics"] }

crates/services/src/attestation/verification.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,15 @@ const NVIDIA_NRAS_URL: &str = "https://nras.attestation.nvidia.com/v3/attest/gpu
1616
/// this modest to avoid piling attestation work on inference backends.
1717
pub const ATTESTATION_DISCOVERY_PARALLELISM: usize = 5;
1818

19+
/// Number of cumulative attestation calls per reused provider on each refresh.
20+
///
21+
/// Each cycle adds a small number of fresh-TCP discovery calls to a reused
22+
/// provider, which accumulates new backend fingerprints into the shared
23+
/// `FingerprintState`. Over several cycles this covers every backend behind
24+
/// the L4 LB, even when the initial discovery only hit one. Kept small so
25+
/// steady-state refresh load stays low.
26+
pub const CUMULATIVE_DISCOVERY_CALLS: usize = 2;
27+
1928
/// Result of verifying an attestation report from an inference backend.
2029
#[derive(Debug, Clone)]
2130
pub struct VerifiedAttestation {

0 commit comments

Comments
 (0)