feat(proxy): bound ML pipeline concurrency#241
Merged
Conversation
Pre-request ML detection is CPU-bound (ensemble + jailbreak + fusion). Without a cap, a flood of parallel requests in a single pod saturates the CPU and degrades every concurrent request uniformly. Bound the path with a per-pod tokio semaphore sized from `ml_pipeline.max_concurrent_requests` (default 8, env override `LLMTRACE_ML_MAX_CONCURRENT`). Saturation returns 503 with `Retry-After: 1` via `try_acquire` — fast, queue-free backpressure rather than every in-flight request stalling. Adds gauge `llmtrace_ml_inflight_requests` (permits-in-use) and counter `llmtrace_ml_rejected_total` to /metrics. Integration test `ml_pipeline_semaphore_rejects_excess_concurrent_requests` fires N+1 concurrent requests against a real proxy with a slow `SecurityAnalyzer` impl and asserts exactly 1 rejection with `Retry-After: 1` and exactly N successful 200s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tokio::sync::SemaphoreonAppState. Sized fromcfg.ml_pipeline.max_concurrent_requests(default 8, env overrideLLMTRACE_ML_MAX_CONCURRENT).try_acquireon the synchronous pre-request enforcement block; on failure, immediately respond503 Service UnavailablewithRetry-After: 1. No queueing. Why Option A: the brief recommends it as the default, and the codebase's other rejection paths (rate limiting, cost caps) follow the same "fast reject, let the client retry" pattern. Queueing would mask saturation as latency and is harder to alert on.llmtrace_ml_inflight_requests(permits-in-use) and counterllmtrace_ml_rejected_totalto/metrics, following the existingMetricsregistration pattern.run_security_analysisin the spawned task) is left unbounded — it runs after the upstream response has been streamed to the client and does not block client latency. Caveat documented below.Files changed
crates/llmtrace-core/src/lib.rs— newMlPipelineConfigstruct, wired intoProxyConfig.crates/llmtrace-proxy/src/config.rs— env override + validation forLLMTRACE_ML_MAX_CONCURRENT.crates/llmtrace-proxy/src/metrics.rs— new gauge + counter, registered and exposed.crates/llmtrace-proxy/src/proxy.rs—AppState::ml_pipeline_semaphorefield,try_acquirearound enforcement,ml_saturated_responsehelper, module-level doc update.crates/llmtrace-proxy/src/main.rs— semaphore construction from config.crates/llmtrace-proxy/src/{api,auth,compliance,feature_flags_api,grpc,otel,tenant_api}.rs—AppStatetest-fixture literals updated.crates/llmtrace-proxy/tests/integration_test.rs— new test plus a realSlowAnalyzerSecurityAnalyzerimpl (NOT a mock — it implements the trait fully, just sleeps inanalyze_request).deployments/basilica/README.md— operator-facing note on the new env var and metrics.Saturation strategy + behavior under load
if cfg.enable_security_analysis { ... }block and released right afteraction_router.execute_inline(...)and before the upstream HTTP forward. This bounds only the CPU-bound ML/regex enforcement window; the upstream HTTP call and the response-streaming path are not counted against the cap.try_acquire_ownedis used so the permit is'static-compatible if needed; on saturation we incrementllmtrace_ml_rejected_total, decrementactive_connections, log awarn!, and return 503 withRetry-After: 1.Validation
The new test
ml_pipeline_semaphore_rejects_excess_concurrent_requestsbuilds a real proxy withSlowAnalyzer(sleeps 250 ms inanalyze_request), sets the cap to 3, fires 4 concurrent HTTP requests through a liveaxum::servelistener, and asserts:503 Service Unavailable,Retry-After: 1,200 OK,llmtrace_ml_rejected_total == 1,llmtrace_ml_inflight_requestsdrains to 0 after the admitted requests release their permit.Caveats / unvalidated
run_security_analysisin the spawned task) is intentionally not bounded by the semaphore in this PR. Bounding it would either add latency (waiting for a permit after the client has already received the response) or skip analysis. Left for a follow-up if production CPU profiles show this background path is a contributor.try_acquirestrategy means the cap is "hard" — a brief burst above the cap will produce 503s rather than smoothing latency. Operators tuningLLMTRACE_ML_MAX_CONCURRENTshould watch the newllmtrace_ml_rejected_totaland size the cap to match the pod's CPU budget; a sustained rejection rate is the alert signal.