Add cross-model global budget rate limiting with Valkey provider#1727
Add cross-model global budget rate limiting with Valkey provider#1727fbalicchia wants to merge 29 commits intovllm-project:mainfrom
Conversation
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
✅ Supply Chain Security Report — All Clear
Scanned at |
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
bf67f8f to
bfca243
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Filippo Balicchia <fbalicchia@gmail.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
There was a problem hiding this comment.
Thanks for this PR. There are a few issues that need to be addressed before merging:
-
This PR modifies the existing Valkey cache (valkey_cache.go, valkey_cache_helpers.go) in ways that appear unrelated to the rate limiter feature. These changes (TEXT→TAG schema change, FT.SEARCH replaced with reverse-lookup keys, validation code removed) introduce new complexity and reduce parity with the Milvus backend. If the rate limiter needs additional cache capabilities, it could add new methods rather than changing existing logic that has no issues.
-
Bug fix should not be blocked — The router_components.go fix that wires up the Valkey config into createSemanticCache is important. Please consider merging it as a separate small PR so it doesn't get delayed by iteration on the rate limiter. Has been split out into #1737
-
Atomicity concerns — Both the new pending:* keys in the cache and the INCRBY + EXPIRE in the rate limiter lack atomicity guarantees.
See inline comments for details.
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: Filippo Balicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
Signed-off-by: fbalicchia <fbalicchia@cuebiq.com>
|
@daric93 thanks for the review — here’s a brief recap. Let me know if I missed anything.
Also Dockerfile.extproc was reverted |

Purpose
tokens (cache_read/cache_write).
budget overruns.
Router/CI/BuildKey changes
valkey_provider.go— newvalkey-limiterrate limit provider using ValkeyINCRBYwith TTL-based windows. Cost expressed in CEL units ($10⁻⁸) for parity with AI Gateway formulas. Built on thevalkey-glideGo client.TokenUsage, response usage parsing, andModelPricingconfig withCacheReadPer1M/CacheWritePer1Mrates (Anthropic & OpenAI formats).DBfield on rate limit provider config,GetModelPricingFull()helper, provider type renamed fromredis-limitertovalkey-limiter.extractStreamingUsageandreportStreamingUsageMetricsnow propagate cache token counts.Dockerfile.extprocfor reliable multi-arch (amd64 + arm64) builds includingnlp-binding.resolveModelConfiginhelper.go.Test Plan
go test ./... -run TestValkey— unit tests for the Valkey limiter (valkey_provider_test.go)go test ./... -run TestCELParity— verifies CEL cost calculations match Envoy AI Gateway formulas (cel_parity_test.go)e2e/scripts/test_ratelimit_e2e.sh— end-to-end rate limit validation against a running Valkey instancedocker buildx build --platform linux/amd64,linux/arm64 -f Dockerfile.extproc .— confirms multi-arch build succeedsTest Result