Skip to content

Add cross-model global budget rate limiting with Valkey provider#1727

Open
fbalicchia wants to merge 29 commits intovllm-project:mainfrom
fbalicchia:feat/global-model-ratelimit-clean
Open

Add cross-model global budget rate limiting with Valkey provider#1727
fbalicchia wants to merge 29 commits intovllm-project:mainfrom
fbalicchia:feat/global-model-ratelimit-clean

Conversation

@fbalicchia
Copy link
Copy Markdown
Contributor

Purpose

  • What: Adds a Valkey-backed, cross-model global budget rate limiter that enforces a single spending envelope per user across all AI models (Haiku, Sonnet, Opus), regardless of which model serves the request. Also extends cost tracking to include prompt caching
    tokens (cache_read/cache_write).
  • Why: Envoy AI Gateway's native rate limiting is per-route (i.e. per-model), so a user with a $3/month budget could spend $3 on each model independently. With the Semantic Router dynamically selecting models, a true cross-model budget is essential to prevent
    budget overruns.
  • Modules affected: Router / CI/Build

Key changes

  • valkey_provider.go — new valkey-limiter rate limit provider using Valkey INCRBY with TTL-based windows. Cost expressed in CEL units ($10⁻⁸) for parity with AI Gateway formulas. Built on the valkey-glide Go client.
  • Prompt caching cost tracking — extended TokenUsage, response usage parsing, and ModelPricing config with CacheReadPer1M / CacheWritePer1M rates (Anthropic & OpenAI formats).
  • Config additionsDB field on rate limit provider config, GetModelPricingFull() helper, provider type renamed from redis-limiter to valkey-limiter.
  • Streaming supportextractStreamingUsage and reportStreamingUsageMetrics now propagate cache token counts.
  • Dockerfile fix — restructured Dockerfile.extproc for reliable multi-arch (amd64 + arm64) builds including nlp-binding.
  • Build fix — removed duplicate resolveModelConfig in helper.go.

Test Plan

  • go test ./... -run TestValkey — unit tests for the Valkey limiter (valkey_provider_test.go)
  • go test ./... -run TestCELParity — verifies CEL cost calculations match Envoy AI Gateway formulas (cel_parity_test.go)
  • e2e/scripts/test_ratelimit_e2e.sh — end-to-end rate limit validation against a running Valkey instance
  • docker buildx build --platform linux/amd64,linux/arm64 -f Dockerfile.extproc . — confirms multi-arch build succeeds

Test Result

  • All unit tests pass; CEL parity tests confirm cost calculations match the Envoy AI Gateway formulas within rounding tolerance.
  • Multi-arch Docker build completes successfully.
  • Follow-up: E2E test requires a running Valkey instance; CI integration for this is not yet wired up.

Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 8, 2026

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 341428a
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/69d96355a10a0c000980cc30
😎 Deploy Preview https://deploy-preview-1727--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 config

Owners: @rootfs, @Xunzhuo
Files changed:

  • config/config.yaml

📁 src/semantic-router

Owners: @rootfs, @Xunzhuo, @szedan-rh, @yehuditkerido, @abdallahsamabd, @asaadbalum, @liavweiss, @noalimoy
Files changed:

  • src/semantic-router/pkg/cache/valkey_cache_helpers.go
  • src/semantic-router/pkg/cache/valkey_cache_integration_test.go
  • src/semantic-router/pkg/config/config.go
  • src/semantic-router/pkg/config/helper.go
  • src/semantic-router/pkg/config/helper_provider.go
  • src/semantic-router/pkg/config/model_config_types.go
  • src/semantic-router/pkg/extproc/processor_req_body_prepare.go
  • src/semantic-router/pkg/extproc/processor_res_body_streaming.go
  • src/semantic-router/pkg/extproc/processor_res_cache.go
  • src/semantic-router/pkg/extproc/processor_res_usage.go
  • src/semantic-router/pkg/extproc/processor_res_usage_test.go
  • src/semantic-router/pkg/extproc/router_resolvers.go
  • src/semantic-router/pkg/ratelimit/cel_parity_test.go
  • src/semantic-router/pkg/ratelimit/local_provider.go
  • src/semantic-router/pkg/ratelimit/provider.go
  • src/semantic-router/pkg/ratelimit/ratelimit_test.go
  • src/semantic-router/pkg/ratelimit/valkey_provider.go
  • src/semantic-router/pkg/ratelimit/valkey_provider_test.go

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

✅ Supply Chain Security Report — All Clear

Scanner Status Findings
AST Codebase Scan (Py, Go, JS/TS, Rust) 27 finding(s) — MEDIUM: 21 · LOW: 6
AST PR Diff Scan No issues detected
Regex Fallback Scan No issues detected

Scanned at 2026-04-10T20:58:18.789Z · View full workflow logs

Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
@fbalicchia fbalicchia force-pushed the feat/global-model-ratelimit-clean branch from bf67f8f to bfca243 Compare April 8, 2026 20:20
fbalicchia and others added 2 commits April 9, 2026 18:04
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Filippo Balicchia <fbalicchia@gmail.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Copy link
Copy Markdown
Contributor

@daric93 daric93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. There are a few issues that need to be addressed before merging:

  • This PR modifies the existing Valkey cache (valkey_cache.go, valkey_cache_helpers.go) in ways that appear unrelated to the rate limiter feature. These changes (TEXT→TAG schema change, FT.SEARCH replaced with reverse-lookup keys, validation code removed) introduce new complexity and reduce parity with the Milvus backend. If the rate limiter needs additional cache capabilities, it could add new methods rather than changing existing logic that has no issues.

  • Bug fix should not be blocked — The router_components.go fix that wires up the Valkey config into createSemanticCache is important. Please consider merging it as a separate small PR so it doesn't get delayed by iteration on the rate limiter. Has been split out into #1737

  • Atomicity concerns — Both the new pending:* keys in the cache and the INCRBY + EXPIRE in the rate limiter lack atomicity guarantees.

See inline comments for details.

Copy link
Copy Markdown
Contributor

@daric93 daric93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the router_components.go Valkey config wiring fix has been split out into #1737 so it can be merged independently.

@fbalicchia
Copy link
Copy Markdown
Contributor Author

@daric93 thanks for the review — here’s a brief recap. Let me know if I missed anything.

  • Revert valkey_cache.go to main: remove reverse-lookup pending:* keys
    and restore FT.SEARCH-based pending entry resolution (to be submitted
    as a separate PR)
  • Add debug logging when no ratelimit rule matches a request (fail-open)
    in both LocalLimiter and ValkeyLimiterProvider, helping operators catch
    misconfigured rule sets
  • Replace manual reverse-scan in parseHostPort with net.SplitHostPort
    for correct IPv6 address handling

Also Dockerfile.extproc was reverted

@fbalicchia fbalicchia requested a review from daric93 April 10, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.