Skip to content

Commit ca6cc73

Browse files
authored
feat: rotation-SNI discovery + rapid-eviction pin set (#603)
* feat: rotation-SNI discovery + rapid-eviction pin set cumulative discovery used to stall on multi-backend models because the proxy's least-connections LB collapses fresh-TCP probes onto a stable subset of backends. We worked around this with parallelism (5 calls per new provider, 2 per refresh cycle) and inter-model staggering, but the shape was fundamentally O(luck): some replicas kept missing some backends forever and got TLS-handshake-rejected when the LB later routed them there. The customer-visible symptom is ~42% of /v1/attestation/report calls for GLM-5.1-FP8 failing with "error sending request". model-proxy PR #27 published a deterministic routing knob: rotation SNI '<canonical>-i<N>.<base>' routes to 'healthy_backends_sorted[N % healthy]', and GET /backends/count?domain=<host> reports the current healthy count. This PR rewrites discover_model on top of those two pieces. Per-cycle flow: - Fetch the healthy backend count from /backends/count. - Fan out one fresh-TCP attestation call per backend index, in parallel, no stagger. Each call lands on a distinct backend by construction, so per-backend GPU evidence pressure per cycle is exactly one attestation regardless of how many models refresh together. - Apply the cycle's verified fingerprints to the shared pin set according to apply_pin_update(): * Complete coverage (no failures, verify_failures == 0, distinct observed fingerprints == backend_count): REPLACE the pin set with the observed set. A backend that just went unhealthy or had its cert rotated drops out of the pin set within one refresh interval — rapid eviction. * Anything less: additive merge. A transient hiccup never evicts verified fingerprints we just couldn't reconfirm. Eliminates: - ATTESTATION_DISCOVERY_PARALLELISM (was 5) - CUMULATIVE_DISCOVERY_CALLS (was 2) - STAGGER_MS (intra-model, was 200) - MODEL_DISCOVERY_STAGGER_MS (inter-model, was 2_000) discover_model loses the num_calls parameter. Both call sites (the new-provider phase in load_inference_url_models and the cumulative refresh path) become identical. DiscoveryOutcome gains: - backend_count: healthy count from /backends/count this cycle, 0 if the fetch failed (failure_reasons[0] then carries the reason). - replaced_state: true iff this cycle achieved complete coverage and the pin set was wholesale replaced rather than additively merged. Both fields are surfaced on the existing INFO log lines (initial discovery, cumulative expansion, cumulative no-new-fingerprints) for DD-side observability. URL handling derives the base domain by stripping the leftmost DNS label of the inference URL host. Works for every URL we have today ('*.completions{,-stg}.near.ai'); URLs that don't fit (one-label hosts, IP literals) return an empty outcome with a 'url_parse:' failure reason and the existing fail-closed path handles eviction. Tests: - spki_verifier: replace_with state transitions (Bootstrap->Pinned, Pinned shrink, Blocked->Pinned recovery, empty set). - rotation: 10 URL-helper tests covering canonicals with internal dashes, case insensitivity, port preservation, IP/one-label rejection, count-URL shape. - inference_provider_pool: 8 apply_pin_update policy tests covering steady state, eviction on shrinking count, partial-cycle additive preservation, duplicate-observation safety, verify_failure blocking replacement, zero-count safety, bootstrap first cycle. Followup #600: rotation SNI for chat-completion bucket pre-warm. * review: distinguish count_zero, cap fan-out, redact reqwest URLs Address bot review feedback on #603: - count_zero vs count-fetch-failure are now distinguishable in failure_reasons. Previously both rendered as empty / generic count_*:; now Ok(0) records 'count_zero: proxy reports 0 healthy backends' explicitly. - Sanity-cap rotation fan-out at 256 backends per model per cycle. A bogus registry reading (race during deploy, partial split) would otherwise spawn an unbounded number of fresh-TCP TLS handshakes. Hitting the cap is logged and recorded in failure_reasons. - Strip the request URL from every reqwest error in failure_reasons via Error::without_url(). The URLs embed our random per-call nonce, which would otherwise create unbounded label cardinality in DD when any reqwest error path fires. Full error stays available at DEBUG via the existing debug! lines. - pin_update_verify_failure_blocks_replacement test now uses an input shape that the production caller can actually produce (backend_count=4, verified=3, verify_failures=1). The policy assertion is unchanged.
1 parent 5a81e72 commit ca6cc73

4 files changed

Lines changed: 768 additions & 162 deletions

File tree

crates/inference_providers/src/spki_verifier.rs

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,24 @@ impl FingerprintState {
5757
// Don't block if already Pinned — keep existing verified fingerprints
5858
}
5959

60+
/// Replace the pinned set wholesale.
61+
///
62+
/// Called once per discovery cycle when the cycle achieved complete
63+
/// coverage (every healthy backend produced exactly one verified
64+
/// fingerprint). Lets the pin set track the *current* healthy set rather
65+
/// than accumulating every backend the proxy ever routed to — when a
66+
/// backend goes unhealthy or its cert rotates, its old fingerprint is
67+
/// dropped within one refresh interval.
68+
///
69+
/// Transitions Bootstrap → Pinned and Blocked → Pinned, matching
70+
/// `add_fingerprint`. An empty `fps` is permitted; callers treat that as
71+
/// "no healthy backends right now" and the provider-level fail-closed
72+
/// path keeps connections rejected until a future cycle re-pins
73+
/// something.
74+
pub fn replace_with(&mut self, fps: HashSet<String>) {
75+
*self = FingerprintState::Pinned(fps);
76+
}
77+
6078
/// Number of pinned fingerprints (0 for Bootstrap/Blocked).
6179
pub fn pinned_count(&self) -> usize {
6280
match self {
@@ -266,4 +284,65 @@ mod tests {
266284
assert!(matches!(state, FingerprintState::Pinned(_)));
267285
assert_eq!(state.pinned_count(), 1);
268286
}
287+
288+
#[test]
289+
fn test_replace_with_from_bootstrap() {
290+
let mut state = FingerprintState::Bootstrap;
291+
let mut fps = HashSet::new();
292+
fps.insert("a".to_string());
293+
fps.insert("b".to_string());
294+
state.replace_with(fps);
295+
assert!(matches!(state, FingerprintState::Pinned(_)));
296+
assert_eq!(state.pinned_count(), 2);
297+
}
298+
299+
#[test]
300+
fn test_replace_with_shrinks_pinned() {
301+
let mut state = FingerprintState::Bootstrap;
302+
for fp in ["a", "b", "c", "d", "e"] {
303+
state.add_fingerprint(fp.to_string());
304+
}
305+
assert_eq!(state.pinned_count(), 5);
306+
307+
// Backend went away — pin set tracks the new healthy set.
308+
let mut shrunk = HashSet::new();
309+
shrunk.insert("a".to_string());
310+
shrunk.insert("b".to_string());
311+
shrunk.insert("c".to_string());
312+
shrunk.insert("d".to_string());
313+
state.replace_with(shrunk);
314+
assert_eq!(state.pinned_count(), 4);
315+
if let FingerprintState::Pinned(set) = &state {
316+
assert!(set.contains("a"));
317+
assert!(!set.contains("e"), "evicted fingerprint must be gone");
318+
} else {
319+
panic!("expected Pinned");
320+
}
321+
}
322+
323+
#[test]
324+
fn test_replace_with_from_blocked() {
325+
// Blocked → Pinned mirrors add_fingerprint's recovery path.
326+
let mut state = FingerprintState::Bootstrap;
327+
state.block();
328+
assert!(matches!(state, FingerprintState::Blocked));
329+
330+
let mut fps = HashSet::new();
331+
fps.insert("recovered".to_string());
332+
state.replace_with(fps);
333+
assert!(matches!(state, FingerprintState::Pinned(_)));
334+
assert_eq!(state.pinned_count(), 1);
335+
}
336+
337+
#[test]
338+
fn test_replace_with_empty_set_is_permitted() {
339+
// Caller may pass an empty set to express "no healthy backends".
340+
// The provider-level fail-closed path is responsible for rejecting
341+
// connections; FingerprintState just stores the (empty) Pinned set.
342+
let mut state = FingerprintState::Bootstrap;
343+
state.add_fingerprint("a".to_string());
344+
state.replace_with(HashSet::new());
345+
assert!(matches!(state, FingerprintState::Pinned(_)));
346+
assert_eq!(state.pinned_count(), 0);
347+
}
269348
}

crates/services/src/attestation/verification.rs

Lines changed: 0 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -8,41 +8,6 @@ use std::collections::HashSet;
88

99
const NVIDIA_NRAS_URL: &str = "https://nras.attestation.nvidia.com/v3/attest/gpu";
1010

11-
/// Number of parallel attestation calls per model to discover TLS fingerprints
12-
/// from multiple backends behind L4 load balancing.
13-
///
14-
/// Each cloud-api instance runs its own discovery, so the effective load on a
15-
/// model is `PARALLELISM * cloud-api instance count` per refresh cycle. Keep
16-
/// this modest to avoid piling attestation work on inference backends.
17-
pub const ATTESTATION_DISCOVERY_PARALLELISM: usize = 5;
18-
19-
/// Number of cumulative attestation calls per reused provider on each refresh.
20-
///
21-
/// Each cycle adds a small number of fresh-TCP discovery calls to a reused
22-
/// provider, which accumulates new backend fingerprints into the shared
23-
/// `FingerprintState`. Over several cycles this covers every backend behind
24-
/// the L4 LB, even when the initial discovery only hit one. Kept small so
25-
/// steady-state refresh load stays low.
26-
pub const CUMULATIVE_DISCOVERY_CALLS: usize = 2;
27-
28-
/// Inter-model stagger for cumulative discovery on each refresh cycle (milliseconds).
29-
///
30-
/// When the provider pool refreshes, it runs cumulative attestation discovery
31-
/// for every reused model. Without staggering, all models fire their first
32-
/// discovery call at t=0, creating a burst that saturates the GPU evidence
33-
/// worker on dense hosts (e.g. gpu04 runs 8+ model instances).
34-
///
35-
/// With this stagger, model i starts its discovery after `i * MODEL_DISCOVERY_STAGGER_MS`
36-
/// delay. At 2 s/model the burst is spread across tens of seconds rather than
37-
/// a single wall-clock instant, while still completing well within the 5-minute
38-
/// refresh interval even for large pools.
39-
///
40-
/// Note: the cumulative discovery loop runs inside `buffer_unordered(10)`, so
41-
/// tasks at index >= 10 begin their sleep only after a concurrency slot opens.
42-
/// Their effective wall-clock delay is therefore ≥ i × STAGGER_MS, making the
43-
/// spread more conservative (not less) for pools larger than 10 models.
44-
pub const MODEL_DISCOVERY_STAGGER_MS: u64 = 2_000;
45-
4611
/// Result of verifying an attestation report from an inference backend.
4712
#[derive(Debug, Clone)]
4813
pub struct VerifiedAttestation {

0 commit comments

Comments
 (0)