[Feat]: add scheduled memory pruning with two-path strategy#1373
[Feat]: add scheduled memory pruning with two-path strategy#1373abdallahsamabd wants to merge 1 commit intovllm-project:mainfrom
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
6dc1e37 to
baf9f10
Compare
|
@abdallahsamabd how is different than #1313? Do you have any benchmark on time decay or quota based prune strategies like the memorybank paper? |
baf9f10 to
b432711
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements MemoryBank-style memory pruning with a retention scoring system (R=exp(-t/S)) and two complementary pruning strategies to prevent unbounded memory growth.
Changes:
- Added event-driven cap enforcement (Path 1) that asynchronously prunes memories when users exceed
max_memories_per_useron Store() - Implemented background sweep mechanism (Path 2) using a periodic ticker to prune decayed memories for inactive users in batches
- Added Prometheus metrics for monitoring pruning activity, sweep performance, and error tracking
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| website/docs/proposals/agentic-memory.md | Updated feature status table to mark memory pruning and quotas as implemented |
| website/docs/installation/configuration.md | Added comprehensive documentation for memory pruning configuration, metrics, and multi-replica deployment |
| src/vllm-sr/cli/templates/config.template.yaml | Updated config template with new pruning parameters and two-path strategy explanation |
| src/semantic-router/pkg/memory/pruner_test.go | Added comprehensive test coverage for pruning functionality including cap enforcement, sweep operations, and edge cases |
| src/semantic-router/pkg/memory/pruner.go | Implemented background sweep goroutine with batch processing and graceful shutdown |
| src/semantic-router/pkg/memory/prune_metrics.go | Defined Prometheus metrics for tracking pruning operations and performance |
| src/semantic-router/pkg/memory/milvus_store.go | Added event-driven cap enforcement, helper methods for counting and querying stale memories |
| src/semantic-router/pkg/extproc/server.go | Added graceful shutdown of prune sweep goroutine in Stop() |
| src/semantic-router/pkg/extproc/router.go | Integrated prune sweep startup and added StopPruneSweep field to router |
| src/semantic-router/pkg/config/config.go | Added configuration fields for prune interval, batch size, and sweep enablement |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| // Path 1: event-driven cap enforcement — async prune if user exceeds max_memories_per_user | ||
| if m.config.QualityScoring.MaxMemoriesPerUser > 0 { | ||
| go m.pruneIfOverCap(context.Background(), memory.UserID) |
There was a problem hiding this comment.
Using context.Background() in a goroutine ignores the parent context's cancellation and deadline. Consider propagating a detached context derived from the parent (e.g., using a custom function to extract values without cancellation) or document why ignoring cancellation is acceptable for async pruning.
| go m.pruneIfOverCap(context.Background(), memory.UserID) | |
| go m.pruneIfOverCap(ctx, memory.UserID) |
There was a problem hiding this comment.
context.Background() is intentional here. The goroutine must outlive the HTTP/gRPC request — if we passed ctx, the pruning would be cancelled as soon as Store() returns to the caller
7b5ed7f to
87c6c36
Compare
|
@abdallahsamabd @yehudit1987 Let's have a design review on memory pruning.
|
87c6c36 to
75bfc28
Compare
|
Hi @rootfs @yehudit1987 |
|
@abdallahsamabd thanks for having the design doc. Memory injection has to be dealt with care. Since the router makes decisions on behalf of the users, injecting conflict/wrong/stale memory will have poor consequences (see this). This is the top concern at the moment, pruning will be after that. For any injection and pruning strategy, we need mitigate the risk by using well validated, highly cited research, rather than hand wavy ideas. The memory bank solution makes that cut. If you can support any of your PRs on that basis, it would make them much stronger. |
|
@rootfs
regarding memory injection, I opened this issue #1386 |
91e9850 to
ad1df7f
Compare
b5a0888 to
bfe7794
Compare
…lm-project#1350) Signed-off-by: Abdallah Samara <abdallahsamabd@gmail.com>
bfe7794 to
e058375
Compare

Implement MemoryBank-style retention scoring R=exp(-t/S) with two complementary pruning paths:
Includes Prometheus metrics, graceful shutdown, multi-replica support via prune_sweep_enabled flag, config template, and documentation.
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to semantic-router! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[CLI]for changes to the command-line interface tools.[Dashboard]for changes to the dashboard or web UI.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews