feat(proxy): bound ML pipeline concurrency by epappas · Pull Request #241 · techlab-innov/llmtrace

epappas · 2026-05-19T18:05:53Z

Summary

Bounds intra-pod concurrency of the CPU-bound ML detection pipeline via a tokio::sync::Semaphore on AppState. Sized from cfg.ml_pipeline.max_concurrent_requests (default 8, env override LLMTRACE_ML_MAX_CONCURRENT).
Saturation strategy: Option A — try_acquire on the synchronous pre-request enforcement block; on failure, immediately respond 503 Service Unavailable with Retry-After: 1. No queueing. Why Option A: the brief recommends it as the default, and the codebase's other rejection paths (rate limiting, cost caps) follow the same "fast reject, let the client retry" pattern. Queueing would mask saturation as latency and is harder to alert on.
Adds Prometheus gauge llmtrace_ml_inflight_requests (permits-in-use) and counter llmtrace_ml_rejected_total to /metrics, following the existing Metrics registration pattern.
Scope: the cap is applied only to the synchronous, request-blocking pre-request enforcement path (where 503 is meaningful and prevents client-visible latency stalls). The post-upstream background analysis path (run_security_analysis in the spawned task) is left unbounded — it runs after the upstream response has been streamed to the client and does not block client latency. Caveat documented below.

Files changed

crates/llmtrace-core/src/lib.rs — new MlPipelineConfig struct, wired into ProxyConfig.
crates/llmtrace-proxy/src/config.rs — env override + validation for LLMTRACE_ML_MAX_CONCURRENT.
crates/llmtrace-proxy/src/metrics.rs — new gauge + counter, registered and exposed.
crates/llmtrace-proxy/src/proxy.rs — AppState::ml_pipeline_semaphore field, try_acquire around enforcement, ml_saturated_response helper, module-level doc update.
crates/llmtrace-proxy/src/main.rs — semaphore construction from config.
crates/llmtrace-proxy/src/{api,auth,compliance,feature_flags_api,grpc,otel,tenant_api}.rs — AppState test-fixture literals updated.
crates/llmtrace-proxy/tests/integration_test.rs — new test plus a real SlowAnalyzer SecurityAnalyzer impl (NOT a mock — it implements the trait fully, just sleeps in analyze_request).
deployments/basilica/README.md — operator-facing note on the new env var and metrics.

Saturation strategy + behavior under load

Permit is acquired at the start of the if cfg.enable_security_analysis { ... } block and released right after action_router.execute_inline(...) and before the upstream HTTP forward. This bounds only the CPU-bound ML/regex enforcement window; the upstream HTTP call and the response-streaming path are not counted against the cap.
try_acquire_owned is used so the permit is 'static-compatible if needed; on saturation we increment llmtrace_ml_rejected_total, decrement active_connections, log a warn!, and return 503 with Retry-After: 1.

Validation

cargo fmt --all                       # clean
cargo build -p llmtrace               # clean
cargo build --workspace               # clean
cargo test -p llmtrace-core           # 98 passed
cargo test -p llmtrace                # lib: 595 passed; main: 15 passed; integration: 19 passed
cargo clippy -p llmtrace --tests      # no new warnings from this PR

The new test ml_pipeline_semaphore_rejects_excess_concurrent_requests builds a real proxy with SlowAnalyzer (sleeps 250 ms in analyze_request), sets the cap to 3, fires 4 concurrent HTTP requests through a live axum::serve listener, and asserts:

exactly 1 of 4 returns 503 Service Unavailable,
the 503 carries Retry-After: 1,
the other 3 return 200 OK,
llmtrace_ml_rejected_total == 1,
llmtrace_ml_inflight_requests drains to 0 after the admitted requests release their permit.

Caveats / unvalidated

The post-upstream background ML analysis path (run_security_analysis in the spawned task) is intentionally not bounded by the semaphore in this PR. Bounding it would either add latency (waiting for a permit after the client has already received the response) or skip analysis. Left for a follow-up if production CPU profiles show this background path is a contributor.
The try_acquire strategy means the cap is "hard" — a brief burst above the cap will produce 503s rather than smoothing latency. Operators tuning LLMTRACE_ML_MAX_CONCURRENT should watch the new llmtrace_ml_rejected_total and size the cap to match the pod's CPU budget; a sustained rejection rate is the alert signal.

Pre-request ML detection is CPU-bound (ensemble + jailbreak + fusion). Without a cap, a flood of parallel requests in a single pod saturates the CPU and degrades every concurrent request uniformly. Bound the path with a per-pod tokio semaphore sized from `ml_pipeline.max_concurrent_requests` (default 8, env override `LLMTRACE_ML_MAX_CONCURRENT`). Saturation returns 503 with `Retry-After: 1` via `try_acquire` — fast, queue-free backpressure rather than every in-flight request stalling. Adds gauge `llmtrace_ml_inflight_requests` (permits-in-use) and counter `llmtrace_ml_rejected_total` to /metrics. Integration test `ml_pipeline_semaphore_rejects_excess_concurrent_requests` fires N+1 concurrent requests against a real proxy with a slow `SecurityAnalyzer` impl and asserts exactly 1 rejection with `Retry-After: 1` and exactly N successful 200s.

epappas merged commit 458fc36 into main May 19, 2026
15 checks passed

epappas mentioned this pull request May 20, 2026

ops(basilica): default startup_timeout too tight when ML preload is on #243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(proxy): bound ML pipeline concurrency#241

feat(proxy): bound ML pipeline concurrency#241
epappas merged 1 commit into
mainfrom
feat/proxy-ml-concurrency-cap

epappas commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

epappas commented May 19, 2026

Summary

Files changed

Saturation strategy + behavior under load

Validation

Caveats / unvalidated

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant