Skip to content

feat(proxy): bound ML pipeline concurrency#241

Merged
epappas merged 1 commit into
mainfrom
feat/proxy-ml-concurrency-cap
May 19, 2026
Merged

feat(proxy): bound ML pipeline concurrency#241
epappas merged 1 commit into
mainfrom
feat/proxy-ml-concurrency-cap

Conversation

@epappas
Copy link
Copy Markdown
Collaborator

@epappas epappas commented May 19, 2026

Summary

  • Bounds intra-pod concurrency of the CPU-bound ML detection pipeline via a tokio::sync::Semaphore on AppState. Sized from cfg.ml_pipeline.max_concurrent_requests (default 8, env override LLMTRACE_ML_MAX_CONCURRENT).
  • Saturation strategy: Option Atry_acquire on the synchronous pre-request enforcement block; on failure, immediately respond 503 Service Unavailable with Retry-After: 1. No queueing. Why Option A: the brief recommends it as the default, and the codebase's other rejection paths (rate limiting, cost caps) follow the same "fast reject, let the client retry" pattern. Queueing would mask saturation as latency and is harder to alert on.
  • Adds Prometheus gauge llmtrace_ml_inflight_requests (permits-in-use) and counter llmtrace_ml_rejected_total to /metrics, following the existing Metrics registration pattern.
  • Scope: the cap is applied only to the synchronous, request-blocking pre-request enforcement path (where 503 is meaningful and prevents client-visible latency stalls). The post-upstream background analysis path (run_security_analysis in the spawned task) is left unbounded — it runs after the upstream response has been streamed to the client and does not block client latency. Caveat documented below.

Files changed

  • crates/llmtrace-core/src/lib.rs — new MlPipelineConfig struct, wired into ProxyConfig.
  • crates/llmtrace-proxy/src/config.rs — env override + validation for LLMTRACE_ML_MAX_CONCURRENT.
  • crates/llmtrace-proxy/src/metrics.rs — new gauge + counter, registered and exposed.
  • crates/llmtrace-proxy/src/proxy.rsAppState::ml_pipeline_semaphore field, try_acquire around enforcement, ml_saturated_response helper, module-level doc update.
  • crates/llmtrace-proxy/src/main.rs — semaphore construction from config.
  • crates/llmtrace-proxy/src/{api,auth,compliance,feature_flags_api,grpc,otel,tenant_api}.rsAppState test-fixture literals updated.
  • crates/llmtrace-proxy/tests/integration_test.rs — new test plus a real SlowAnalyzer SecurityAnalyzer impl (NOT a mock — it implements the trait fully, just sleeps in analyze_request).
  • deployments/basilica/README.md — operator-facing note on the new env var and metrics.

Saturation strategy + behavior under load

  • Permit is acquired at the start of the if cfg.enable_security_analysis { ... } block and released right after action_router.execute_inline(...) and before the upstream HTTP forward. This bounds only the CPU-bound ML/regex enforcement window; the upstream HTTP call and the response-streaming path are not counted against the cap.
  • try_acquire_owned is used so the permit is 'static-compatible if needed; on saturation we increment llmtrace_ml_rejected_total, decrement active_connections, log a warn!, and return 503 with Retry-After: 1.

Validation

cargo fmt --all                       # clean
cargo build -p llmtrace               # clean
cargo build --workspace               # clean
cargo test -p llmtrace-core           # 98 passed
cargo test -p llmtrace                # lib: 595 passed; main: 15 passed; integration: 19 passed
cargo clippy -p llmtrace --tests      # no new warnings from this PR

The new test ml_pipeline_semaphore_rejects_excess_concurrent_requests builds a real proxy with SlowAnalyzer (sleeps 250 ms in analyze_request), sets the cap to 3, fires 4 concurrent HTTP requests through a live axum::serve listener, and asserts:

  • exactly 1 of 4 returns 503 Service Unavailable,
  • the 503 carries Retry-After: 1,
  • the other 3 return 200 OK,
  • llmtrace_ml_rejected_total == 1,
  • llmtrace_ml_inflight_requests drains to 0 after the admitted requests release their permit.

Caveats / unvalidated

  • The post-upstream background ML analysis path (run_security_analysis in the spawned task) is intentionally not bounded by the semaphore in this PR. Bounding it would either add latency (waiting for a permit after the client has already received the response) or skip analysis. Left for a follow-up if production CPU profiles show this background path is a contributor.
  • The try_acquire strategy means the cap is "hard" — a brief burst above the cap will produce 503s rather than smoothing latency. Operators tuning LLMTRACE_ML_MAX_CONCURRENT should watch the new llmtrace_ml_rejected_total and size the cap to match the pod's CPU budget; a sustained rejection rate is the alert signal.

Pre-request ML detection is CPU-bound (ensemble + jailbreak + fusion).
Without a cap, a flood of parallel requests in a single pod saturates
the CPU and degrades every concurrent request uniformly. Bound the path
with a per-pod tokio semaphore sized from
`ml_pipeline.max_concurrent_requests` (default 8, env override
`LLMTRACE_ML_MAX_CONCURRENT`). Saturation returns 503 with
`Retry-After: 1` via `try_acquire` — fast, queue-free backpressure
rather than every in-flight request stalling.

Adds gauge `llmtrace_ml_inflight_requests` (permits-in-use) and counter
`llmtrace_ml_rejected_total` to /metrics. Integration test
`ml_pipeline_semaphore_rejects_excess_concurrent_requests` fires N+1
concurrent requests against a real proxy with a slow `SecurityAnalyzer`
impl and asserts exactly 1 rejection with `Retry-After: 1` and exactly
N successful 200s.
@epappas epappas merged commit 458fc36 into main May 19, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant