Skip to content

Commit 74e9706

Browse files
authored
Merge pull request #122 from nearai/feat/gpu-evidence-delegate-proxy
feat: delegate GPU evidence to a sibling proxy (host-level NVML serialization)
2 parents 1ec9456 + 4e7e70e commit 74e9706

12 files changed

Lines changed: 702 additions & 8 deletions

File tree

CLAUDE.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,13 @@ cargo fmt # Format code
1212

1313
No special env vars needed for tests — integration tests use wiremock and fixed signing keys.
1414

15+
For changes that touch NVML, the libnvat SDK FFI, dstack TDX, or
16+
proxy-to-proxy contracts (e.g. `/internal/gpu_evidence`), `cargo test`
17+
isn't enough — see [docs/testing-on-cvm.md](docs/testing-on-cvm.md)
18+
for the real-CVM smoke-test recipe (build a branch image with
19+
`gh workflow run build.yml --ref <branch>`, deploy a 2-proxy stack
20+
inside a gpu0X CVM, probe both happy and leader-down paths).
21+
1522
## Architecture
1623

1724
This is a Rust rewrite of [nearai/vllm-proxy](https://github.com/nearai/vllm-proxy). It proxies OpenAI-compatible API requests to a vLLM/sglang backend, adding cryptographic signing and TEE attestation.

benches/e2e.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,8 @@ fn build_test_app(mock_url: &str) -> axum::Router {
8484
ohttp_enabled: false,
8585
listen_port: 8000,
8686
dstack_socket_path: "/var/run/dstack.sock".to_string(),
87+
gpu_evidence_delegate_url: None,
88+
gpu_evidence_delegate_timeout_secs: 30,
8789
};
8890

8991
let ecdsa = signing::EcdsaContext::from_key_bytes(&ECDSA_KEY).unwrap();

docs/testing-on-cvm.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Testing inference-proxy on a real CVM
2+
3+
`cargo test` covers the unit + integration suites with wiremock-mocked
4+
upstreams. Some changes — anything that touches NVML, dstack TDX, or
5+
the SDK FFI — need a real CVM to validate. This doc is the recipe.
6+
7+
## When to do this
8+
9+
- Changes to `attestation.rs` GPU evidence dispatch
10+
- Changes to the libnvat SDK call (`attestation_sdk.rs`) or its mutex
11+
- New env vars that gate which evidence path is taken
12+
- Anything that adds a new HTTP endpoint to a proxy-to-proxy contract
13+
(e.g. `/internal/gpu_evidence`)
14+
15+
If you're only changing pure-Rust logic with no FFI / dstack / NVIDIA
16+
surface, `cargo test` is enough.
17+
18+
## Where to run
19+
20+
A spare GPU CVM with `USE_NV_ATTESTATION_SDK=true` and a working
21+
`/var/run/dstack.sock`. As of 2026-05-08 that's any `gpu0X` host. Pick
22+
one that isn't load-bearing — gpu07 is the usual canary. Tester is
23+
responsible for not disturbing whatever production model is already
24+
running on the host.
25+
26+
## Build a branch image
27+
28+
The `Build & Deploy` workflow only auto-fires on `main` and tags. For
29+
branch testing, dispatch it manually:
30+
31+
```bash
32+
gh workflow run build.yml --ref <branch> --repo nearai/inference-proxy
33+
gh run list --workflow=build.yml --branch <branch> --repo nearai/inference-proxy --limit 1
34+
```
35+
36+
It tags `:dev` (shared with all non-main branches — pin by digest in
37+
your test compose, not by tag) and prints the digest in the run log
38+
(`IMAGE_DIGEST: sha256:...`).
39+
40+
## 2-proxy delegate smoke test
41+
42+
Validates `GPU_EVIDENCE_DELEGATE_URL` end-to-end: one leader proxy
43+
owns NVML, the other delegates. Created for [PR #122][pr122].
44+
45+
[pr122]: https://github.com/nearai/inference-proxy/pull/122
46+
47+
### Compose file
48+
49+
```yaml
50+
# test-delegate.yaml
51+
x-nvidia: &nvidia
52+
runtime: nvidia
53+
ipc: host
54+
deploy:
55+
resources:
56+
reservations:
57+
devices:
58+
- driver: nvidia
59+
count: all
60+
capabilities: [gpu]
61+
62+
services:
63+
delegate-leader:
64+
<<: *nvidia
65+
image: ${PROXY_IMAGE}
66+
container_name: delegate-leader
67+
user: root
68+
privileged: true
69+
ports:
70+
- "127.0.0.1:18001:8000" # CVM-loopback only, no host exposure
71+
volumes:
72+
- /var/run/dstack.sock:/var/run/dstack.sock
73+
environment:
74+
- MODEL_NAME=zai-org/GLM-5-FP8
75+
- TOKEN=${PROXY_TOKEN}
76+
- VLLM_BASE_URL=http://glm:8000
77+
- USE_NV_ATTESTATION_SDK=true
78+
- LOG_FORMAT=json
79+
- OPENAI_CHAT_COMPATIBILITY_CHECK=false # don't gate on upstream
80+
restart: "no"
81+
82+
delegate-follower:
83+
<<: *nvidia
84+
image: ${PROXY_IMAGE}
85+
container_name: delegate-follower
86+
user: root
87+
privileged: true
88+
ports:
89+
- "127.0.0.1:18002:8000"
90+
volumes:
91+
- /var/run/dstack.sock:/var/run/dstack.sock
92+
environment:
93+
- MODEL_NAME=zai-org/GLM-5-FP8
94+
- TOKEN=${PROXY_TOKEN}
95+
- VLLM_BASE_URL=http://glm:8000
96+
- USE_NV_ATTESTATION_SDK=true
97+
- GPU_EVIDENCE_DELEGATE_URL=http://delegate-leader:8000
98+
- LOG_FORMAT=json
99+
- OPENAI_CHAT_COMPATIBILITY_CHECK=false
100+
depends_on:
101+
- delegate-leader
102+
restart: "no"
103+
```
104+
105+
`MODEL_NAME` is just a label here — neither proxy serves real
106+
inference in this test, so set it to whatever the running model on
107+
the host is so logs aren't confusing. `VLLM_BASE_URL` only matters if
108+
you flip `OPENAI_CHAT_COMPATIBILITY_CHECK=true`.
109+
110+
### Deploy and probe
111+
112+
CVM access on gpu0X is via host jump: `ssh gpuNN` then
113+
`ssh -p 10022 root@localhost`. The CVM's `/tmp` is writable; `/root`
114+
is not.
115+
116+
```bash
117+
# scp the file in (two-hop)
118+
scp test-delegate.yaml gpu07:/tmp/
119+
ssh gpu07 'scp -P 10022 /tmp/test-delegate.yaml root@localhost:/tmp/'
120+
121+
# run inside the CVM
122+
ssh gpu07 'ssh -p 10022 root@localhost' <<'CVM'
123+
mkdir -p /tmp/deltest && cd /tmp/deltest && mv /tmp/test-delegate.yaml .
124+
PROXY_IMAGE='nearaidev/vllm-proxy-rs@sha256:<digest from build run>' \
125+
PROXY_TOKEN=delegate-test-token-1234 \
126+
docker compose -f test-delegate.yaml -p deltest up -d
127+
CVM
128+
```
129+
130+
### What to verify
131+
132+
```bash
133+
# happy path — fresh nonce, leader up
134+
NONCE=$(openssl rand -hex 32)
135+
curl -w "code=%{http_code} t=%{time_total}\n" -o /tmp/r.json \
136+
"http://127.0.0.1:18002/v1/attestation/report?signing_algo=ed25519&nonce=$NONCE"
137+
# expect: 200, ~290 KB body, request_nonce matches
138+
139+
# loop-guard / dependency proof — fresh nonce, leader DOWN
140+
docker stop delegate-leader
141+
NONCE=$(openssl rand -hex 32)
142+
curl -w "code=%{http_code} t=%{time_total}\n" \
143+
"http://127.0.0.1:18002/v1/attestation/report?signing_algo=ed25519&nonce=$NONCE"
144+
# expect: 500
145+
# follower logs: "delegate request to http://delegate-leader:8000/internal/gpu_evidence failed"
146+
147+
# isolation check — leader's logs should have all libnvat output
148+
docker logs delegate-leader 2>&1 | grep '\[nvat\]' | head # many lines
149+
docker logs delegate-follower 2>&1 | grep '\[nvat\]' | head # zero lines
150+
```
151+
152+
### Tear down
153+
154+
```bash
155+
docker compose -f test-delegate.yaml -p deltest down -v
156+
rm -rf /tmp/deltest
157+
```
158+
159+
Then on the host: confirm `docker ps --filter name=delegate` is empty
160+
and the production model (e.g. `glm51`, `qwen3-vl`) shows
161+
`RestartCount=0` in `docker inspect`.
162+
163+
## CVM gotchas
164+
165+
- The dstack OS is busybox, not Ubuntu. `head -c`, `head -3` etc.
166+
don't work — use `dd if=… bs=N count=1` for byte-cap reads.
167+
- `/root` is read-only at SSH level; use `/tmp/<subdir>` for any test
168+
artifacts.
169+
- `python3` is absent — use `jq` for JSON inspection.
170+
- `--gpus all` is fine for read-only NVML access; you don't need to
171+
unplug the running model. The whole point of PR #122 is that
172+
multiple proxies CAN share GPUs as long as only one talks to NVML.
173+
- The `:dev` image tag is shared across all non-main branches. **Pin
174+
by digest** in test compose files so a parallel branch build can't
175+
swap the image under you.

src/attestation.rs

Lines changed: 58 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,15 @@ pub struct ComposeManagerConfig {
433433
pub url: String,
434434
}
435435

436+
/// Owned-lifetime version of `DelegateContext` used by the background
437+
/// cache refresh task (which doesn't have access to the request-scoped
438+
/// `&Config`/`&Client`). Holds a clone of the `reqwest::Client` and an
439+
/// `Arc<Config>` so the spawned task is `'static`.
440+
pub struct DelegateRefreshConfig {
441+
pub config: Arc<crate::config::Config>,
442+
pub http_client: reqwest::Client,
443+
}
444+
436445
/// Build OHTTP attestation payload for the process-wide OHTTP gateway config.
437446
pub fn build_ohttp_attestation(
438447
signing: &crate::signing::SigningPair,
@@ -462,6 +471,7 @@ pub fn spawn_cache_refresh_task(
462471
refresh_interval_secs: u64,
463472
compose_manager: Option<ComposeManagerConfig>,
464473
ohttp_attestation_ed25519: Option<crate::types::OhttpAttestation>,
474+
delegate_refresh: Option<DelegateRefreshConfig>,
465475
) {
466476
tokio::spawn(async move {
467477
// Initial delay to let the server start up.
@@ -492,6 +502,10 @@ pub fn spawn_cache_refresh_task(
492502

493503
// Refresh without TLS fingerprint (most common).
494504
// GPU evidence serialization is handled by the worker Mutex.
505+
let delegate_ctx = delegate_refresh.as_ref().map(|d| DelegateContext {
506+
config: &d.config,
507+
http_client: &d.http_client,
508+
});
495509
match generate_attestation_inner(
496510
AttestationParams {
497511
model_name: &model_name,
@@ -504,6 +518,7 @@ pub fn spawn_cache_refresh_task(
504518
tls_cert_fingerprint: None,
505519
},
506520
Some(&cache),
521+
delegate_ctx.as_ref(),
507522
)
508523
.await
509524
{
@@ -526,6 +541,10 @@ pub fn spawn_cache_refresh_task(
526541

527542
// Also refresh with TLS fingerprint if configured.
528543
if let Some(ref fp) = tls_cert_fingerprint {
544+
let delegate_ctx = delegate_refresh.as_ref().map(|d| DelegateContext {
545+
config: &d.config,
546+
http_client: &d.http_client,
547+
});
529548
match generate_attestation_inner(
530549
AttestationParams {
531550
model_name: &model_name,
@@ -538,6 +557,7 @@ pub fn spawn_cache_refresh_task(
538557
tls_cert_fingerprint: Some(fp.as_str()),
539558
},
540559
Some(&cache),
560+
delegate_ctx.as_ref(),
541561
)
542562
.await
543563
{
@@ -828,6 +848,17 @@ pub struct AttestationParams<'a> {
828848
pub tls_cert_fingerprint: Option<&'a str>,
829849
}
830850

851+
/// Context the delegate-dispatch path needs at the call site.
852+
///
853+
/// Carries the resolved `Config` (for the delegate URL/timeout/auth
854+
/// token) and the shared `reqwest::Client` we use across the proxy.
855+
/// Lifetime-bound to the caller's `AppState` so we don't clone the
856+
/// client per request.
857+
pub struct DelegateContext<'a> {
858+
pub config: &'a crate::config::Config,
859+
pub http_client: &'a reqwest::Client,
860+
}
861+
831862
/// Maximum attempts for `collect_gpu_evidence_with_nonce_check`.
832863
///
833864
/// 4 attempts (1 initial + 3 retries) with exponential backoff between
@@ -954,11 +985,12 @@ fn check_evidence_nonce_binding(
954985
/// Failures (transport errors, repeated nonce mismatches) bubble up so
955986
/// cloud-api can rotate to a different backend instead of submitting
956987
/// known-bad evidence to NRAS.
957-
async fn collect_gpu_evidence_with_nonce_check(
988+
pub(crate) async fn collect_gpu_evidence_with_nonce_check(
958989
nonce_hex: &str,
959990
nonce_bytes: &[u8; 32],
960991
gpu_no_hw_mode: bool,
961992
cache: Option<&AttestationCache>,
993+
delegate_ctx: Option<&DelegateContext<'_>>,
962994
) -> anyhow::Result<serde_json::Value> {
963995
let mut last_failure: Option<NonceMismatch> = None;
964996

@@ -969,13 +1001,28 @@ async fn collect_gpu_evidence_with_nonce_check(
9691001
tokio::time::sleep(std::time::Duration::from_millis(delay_ms)).await;
9701002
}
9711003

972-
// Three backends, in priority order:
973-
// 1. nv-attestation-sdk (Rust → C FFI, opt-in via env var)
974-
// 2. cache's persistent Python worker (existing default)
975-
// 3. one-shot Python subprocess (fallback when no cache)
1004+
// Four backends, in priority order:
1005+
// 1. delegate proxy (HTTP, opt-in via GPU_EVIDENCE_DELEGATE_URL)
1006+
// — used to serialize NVML across multiple proxies sharing a
1007+
// host. Only the delegate touches local NVML.
1008+
// 2. nv-attestation-sdk (Rust → C FFI, opt-in via env var)
1009+
// 3. cache's persistent Python worker (existing default)
1010+
// 4. one-shot Python subprocess (fallback when no cache)
9761011
// The self-check + retry below applies regardless of which one
977-
// produced the evidence.
978-
let evidence = if crate::attestation_sdk::is_active() && !gpu_no_hw_mode {
1012+
// produced the evidence — including evidence returned by the
1013+
// delegate (defense in depth, plus catches the rare "delegate
1014+
// returned 200 but with bytes from a different request").
1015+
let evidence = if let Some(dctx) = delegate_ctx.filter(|_| !gpu_no_hw_mode) {
1016+
// Delegate path. `gpu_no_hw_mode` doesn't make sense across
1017+
// an HTTP hop; fall through to local paths if it's set.
1018+
crate::gpu_evidence_delegate::collect_via_delegate(
1019+
dctx.config,
1020+
dctx.http_client,
1021+
nonce_hex,
1022+
gpu_no_hw_mode,
1023+
)
1024+
.await?
1025+
} else if crate::attestation_sdk::is_active() && !gpu_no_hw_mode {
9791026
// SDK path doesn't support no_gpu_mode (it requires real
9801027
// hardware via NVML); fall back to the Python paths for
9811028
// dev/test environments without GPUs.
@@ -1030,6 +1077,7 @@ async fn collect_gpu_evidence_with_nonce_check(
10301077
async fn generate_attestation_inner(
10311078
params: AttestationParams<'_>,
10321079
cache: Option<&AttestationCache>,
1080+
delegate_ctx: Option<&DelegateContext<'_>>,
10331081
) -> Result<AttestationReport, AttestationError> {
10341082
let nonce_bytes = parse_nonce(params.nonce)?;
10351083
let nonce_hex = hex::encode(nonce_bytes);
@@ -1068,6 +1116,7 @@ async fn generate_attestation_inner(
10681116
&nonce_bytes_for_verify,
10691117
gpu_no_hw_mode,
10701118
cache,
1119+
delegate_ctx,
10711120
)
10721121
.await
10731122
.map_err(AttestationError::Internal)
@@ -1125,6 +1174,7 @@ pub enum AttestationResult {
11251174
pub async fn generate_attestation(
11261175
params: AttestationParams<'_>,
11271176
cache: Option<&AttestationCache>,
1177+
delegate_ctx: Option<&DelegateContext<'_>>,
11281178
) -> Result<AttestationResult, AttestationError> {
11291179
let is_nonceless = params.nonce.is_none();
11301180
let include_tls = params.tls_cert_fingerprint.is_some();
@@ -1141,7 +1191,7 @@ pub async fn generate_attestation(
11411191

11421192
// Generate fresh report. GPU evidence is serialized by the worker Mutex,
11431193
// but TDX quote runs concurrently.
1144-
let report = generate_attestation_inner(params, cache).await?;
1194+
let report = generate_attestation_inner(params, cache, delegate_ctx).await?;
11451195
// Don't cache here — the caller (route handler) caches after fetching
11461196
// compose-manager attestation so cached responses include the full chain.
11471197

src/config.rs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,23 @@ pub struct Config {
9393
// Compose-manager attestation (deployment actions attestation)
9494
pub compose_manager_url: Option<String>,
9595

96+
// GPU evidence delegation (host-level NVML serialization)
97+
/// HTTP base URL of another inference-proxy on the same host that
98+
/// owns NVML evidence collection (e.g. `http://vllm-proxy-leader:8000`).
99+
/// When set, this proxy forwards GPU evidence requests to the
100+
/// delegate's `POST /internal/gpu_evidence` endpoint instead of
101+
/// calling NVML locally. The intent is to serialize NVML access
102+
/// across the *host*, not just within one process — multiple
103+
/// inference-proxy instances sharing the same physical GPUs were
104+
/// observed to race at the firmware level (see #107). When unset,
105+
/// the proxy collects evidence locally via the SDK or Python path.
106+
pub gpu_evidence_delegate_url: Option<String>,
107+
/// Per-attempt timeout for the delegate HTTP call. Default 30s —
108+
/// the delegate's own evidence collection plus its NVML wait
109+
/// dominates this; we want enough headroom to not surface as
110+
/// timeouts under contended load.
111+
pub gpu_evidence_delegate_timeout_secs: u64,
112+
96113
// OpenAI Chat Compatibility Checks
97114
// Validates that hosted models (qwen, glm, etc.) send OpenAI-compliant responses:
98115
// - /v1/models API format
@@ -249,6 +266,12 @@ impl Config {
249266
as u64,
250267
cloud_api_auth_timeout_secs: env_int("CLOUD_API_AUTH_TIMEOUT_SECS", 5) as u64,
251268
compose_manager_url,
269+
gpu_evidence_delegate_url: env::var("GPU_EVIDENCE_DELEGATE_URL")
270+
.ok()
271+
.filter(|s| !s.is_empty())
272+
.map(|s| s.trim_end_matches('/').to_string()),
273+
gpu_evidence_delegate_timeout_secs: env_int("GPU_EVIDENCE_DELEGATE_TIMEOUT_SECS", 30)
274+
as u64,
252275
tls_cert_path,
253276
max_keepalive: env_int("VLLM_PROXY_MAX_KEEPALIVE", 100),
254277
pool_idle_timeout_secs: env_int("VLLM_PROXY_POOL_IDLE_TIMEOUT_SECS", 60) as u64,

0 commit comments

Comments
 (0)