Commit 84367f0
authored
feat: TCP diversity + cumulative attestation discovery (#544)
* feat: TCP diversity + cumulative attestation discovery
On staging, PR #543 eliminated the permanent provider-block deadlock,
but revealed a second issue: every model registered with
pinned_fingerprints=1 regardless of backend count. reqwest negotiates
HTTP/2 over its warm connection pool, so 5 staggered calls on one
Client multiplex onto a single TCP connection, hash to a single L4
backend, and only one fingerprint gets pinned. When model-proxy later
L4-hashes a request to a different backend, TLS is rejected → provider
fails → transient 502 while PR #537's path evicts and recreates the
provider from scratch.
Also noticed GLM-5/GLM-5.1 occasionally registering with pub_keys=0:
the separate post-pin pubkey fetch was making a second pair of round-
trips on the pinned connection and failing, leaving E2EE routing
broken even though the provider served non-E2EE traffic fine.
Three changes:
1. TCP diversity — each discovery call gets its own VLlmProvider
(fresh reqwest::Client, fresh TCP connection). Peers share the same
fingerprint_state Arc so pin updates propagate. With N separate TCP
handshakes, each hashes independently through L4 → high chance of
covering every backend within a few cycles.
2. Extract pubkeys at discovery time — the attestation report already
contains signing_public_key for the requested algo. The discovery
futures now alternate ecdsa/ed25519 and harvest pubkeys from every
verified response. Since pubkeys are derived from the TEE compose
hash they're identical across a model's backends, so first verified
response per algo wins. Eliminates the separate post-pin fetch.
3. Cumulative discovery — each refresh cycle runs a small number
(CUMULATIVE_DISCOVERY_CALLS=2) of additional discovery calls against
every reused provider. New fingerprints accumulate into the shared
FingerprintState; pubkeys backfill providers that lost their mapping.
Over several cycles this covers every backend behind the L4 LB even
when initial discovery only hit one.
Also:
- Evict reused providers whose fingerprint_state has transitioned to
Blocked — a Blocked verifier rejects every handshake, so cumulative
discovery can't recover in-place; eviction forces a fresh Bootstrap
provider next cycle.
- Legacy refetch path (for reused providers with no tracked
fingerprint state, e.g. test-injected mocks) preserved as a
fallback so #537's regression test still passes.
- Structured logs per discovery call with algo, attempt index,
elapsed_ms, and outcome (success/error/timeout) at debug level;
per-cycle summary at info with successful/failed call counts,
new_fingerprints, total_pinned, and harvested pubkey_algos.
Tests: 26 pool tests pass (was 25), including a new test that verifies
a Blocked fingerprint state causes eviction from all three tracking
maps without making any network calls.
* address review: parallelize cumulative discovery, drop unused field
Three fixes from @claude's review on PR #544:
1. Parallelize cumulative discovery across reused providers. The loop
was awaiting each `discover_model` call sequentially; with
CUMULATIVE_DISCOVERY_CALLS=2 the per-provider cost is ~10.2s worst
case (stagger + per-call timeout), so N providers sequentially could
add minutes to a refresh cycle and starve the background task.
Classify each reused provider upfront (Blocked/Discover/LegacyRefetch/
Skip), then drive the Discover and LegacyRefetch buckets through
`buffer_unordered(10)` concurrently via `tokio::join!`. Evictions
from the Blocked path are collected synchronously (no network).
Collection uses an explicit `while let Some(x) = stream.next().await`
loop rather than `.collect::<Vec<_>>()` because the collector's HRTB
inference trips on tuples containing `Arc<dyn InferenceProvider>`.
2. Move `attestation_reports.push(report)` out — unverified reports
were being collected into the same Vec as verified ones.
3. Drop `DiscoveryOutcome::attestation_reports` entirely. Every caller
discarded it; the field was `#[allow(dead_code)]` and just added
allocation and noise. Fixes (2) implicitly.
* address review: fix self-blocking, lightweight client, proper pruning
Addresses feedback from @gemini-code-assist and @copilot on PR #544.
## Critical correctness
### Self-blocking during discovery (Gemini)
Discovery calls shared the caller's `FingerprintState` Arc. The first
call that completed transitioned the state from `Bootstrap` to
`Pinned({A})` — subsequent calls to different backends (exactly the
purpose of TCP diversity) then had their TLS handshakes rejected
inside `SpkiFingerprintVerifier` for not matching {A}. Explains why
staging showed `pinned_fingerprints=1` for every model even with the
fresh-client path in place.
Each call now uses its own isolated `Bootstrap` state for the TLS
verifier. Verified fingerprints are merged into the caller's shared
accumulator *after* the HTTP call returns, outside the per-call
future. No cross-call interference.
### `known_ptrs` mutation race (Copilot)
`let has = !known_ptrs.insert(ptr)` mutated the set as we iterated
`reused`. When one provider Arc backed multiple models, the second
model's iteration wrongly saw "already mapped" because the first
iteration had just inserted the pointer. Backfill was then skipped
for the second model.
Replaced with an immutable `mapped_ptrs` snapshot (used for membership
checks) and a separate `processed_ptrs` set (used only to deduplicate
work per unique provider Arc).
## Resource efficiency
### 645 reqwest::Client per discovery pass (Gemini)
Each `VLlmProvider::new*` spins up 128 `bucket_clients` plus a main
client. With 5 discovery calls per model that's 645 reqwest clients
spun up and dropped per model per refresh just to make 5 HTTP GETs.
Discovery now uses a minimal `reqwest::Client` built directly from a
shared `SharedTlsRoots` (loaded once at pool creation, not per call)
and a custom `SpkiFingerprintVerifier`. ~1 client per call.
Also adds `tls_roots: SharedTlsRoots` field on `InferenceProviderPool`
so the native-cert DER parsing (~150KB) doesn't run every refresh.
## Correctness
### Per-algo pubkey backfill (Copilot)
Previous gate was binary: "provider has ANY pubkey mapping → skip
backfill". If initial discovery only produced one algo's pubkey (or a
prior cycle partially populated), the missing algo would never get
backfilled.
Now always attempts to merge harvested pubkeys, deduplicating at the
`(pubkey, provider_ptr)` pair level against an immutable snapshot of
`pubkey_to_providers`. Partial mappings self-heal on the next cycle.
### Stale `inference_url_fingerprint_states` entries (Copilot, Gemini)
The states map was appended to but never pruned when a model URL
dropped out of the active set. Small leak that would grow over long
uptimes as models rotate.
After the URL cache is updated, prune the states map to match
(`retain(|url, _| active_urls.contains_key(url))`).
### Evicted providers left in `pubkey_to_providers` (Copilot)
Eviction removed entries from `model_to_providers`,
`inference_url_providers`, and the states map — but not from
`pubkey_to_providers`. Evicted provider Arcs stayed alive via their
pubkey mapping, and future `mapped_ptrs` snapshots would wrongly see
them as "registered" providers.
Eviction now tracks evicted provider pointers and prunes any
`pubkey_to_providers` entries referencing them in the same atomic
write.
## Test plan
- [x] `cargo test -p services --lib` — 196 pass
- [x] `cargo fmt --all -- --check` — clean
- [x] `cargo clippy -p services --lib -- -D warnings` — clean1 parent 30a33c3 commit 84367f0
4 files changed
Lines changed: 805 additions & 200 deletions
File tree
- crates/services
- src
- attestation
- inference_provider_pool
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| 13 | + | |
13 | 14 | | |
14 | 15 | | |
15 | 16 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
19 | 28 | | |
20 | 29 | | |
21 | 30 | | |
| |||
0 commit comments