Skip to content

[Feat]: add scheduled memory pruning with two-path strategy#1373

Open
abdallahsamabd wants to merge 1 commit intovllm-project:mainfrom
abdallahsamabd:feat/1350
Open

[Feat]: add scheduled memory pruning with two-path strategy#1373
abdallahsamabd wants to merge 1 commit intovllm-project:mainfrom
abdallahsamabd:feat/1350

Conversation

@abdallahsamabd
Copy link
Copy Markdown
Collaborator

@abdallahsamabd abdallahsamabd commented Feb 23, 2026

Implement MemoryBank-style retention scoring R=exp(-t/S) with two complementary pruning paths:

  • Path 1 (event-driven): async cap enforcement on Store() when user exceeds max_memories_per_user
  • Path 2 (background sweep): periodic time.Ticker goroutine prunes decayed memories for inactive users in batches

Includes Prometheus metrics, graceful shutdown, multi-replica support via prune_sweep_enabled flag, config template, and documentation.

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to semantic-router! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [CLI] for changes to the command-line interface tools.
  • [Dashboard] for changes to the dashboard or web UI.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 23, 2026

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit e058375
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/69adfa8861deaf0008a8e77b
😎 Deploy Preview https://deploy-preview-1373--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 23, 2026

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/semantic-router/pkg/config/runtime_config.go
  • src/semantic-router/pkg/extproc/router.go
  • src/semantic-router/pkg/extproc/router_build.go
  • src/semantic-router/pkg/extproc/server.go
  • src/semantic-router/pkg/memory/milvus_retry.go
  • src/semantic-router/pkg/memory/milvus_store.go
  • src/semantic-router/pkg/memory/milvus_store_prune.go
  • src/semantic-router/pkg/memory/prune_metrics.go
  • src/semantic-router/pkg/memory/pruner.go
  • src/semantic-router/pkg/memory/pruner_test.go
  • src/semantic-router/pkg/memory/score.go

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/agent/structure-rules.yaml

📁 website

Owners: @Xunzhuo, @rootfs, @yuluo-yx
Files changed:

  • website/docs/installation/configuration.md
  • website/docs/proposals/agentic-memory.md

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs
Copy link
Copy Markdown
Collaborator

rootfs commented Feb 23, 2026

@abdallahsamabd how is different than #1313? Do you have any benchmark on time decay or quota based prune strategies like the memorybank paper?

@abdallahsamabd
Copy link
Copy Markdown
Collaborator Author

abdallahsamabd commented Feb 23, 2026

@rootfs
as we can see in this ticket #1350
PruneUser is currently only callable programmatically — there is no automated job that runs it on a schedule.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements MemoryBank-style memory pruning with a retention scoring system (R=exp(-t/S)) and two complementary pruning strategies to prevent unbounded memory growth.

Changes:

  • Added event-driven cap enforcement (Path 1) that asynchronously prunes memories when users exceed max_memories_per_user on Store()
  • Implemented background sweep mechanism (Path 2) using a periodic ticker to prune decayed memories for inactive users in batches
  • Added Prometheus metrics for monitoring pruning activity, sweep performance, and error tracking

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
website/docs/proposals/agentic-memory.md Updated feature status table to mark memory pruning and quotas as implemented
website/docs/installation/configuration.md Added comprehensive documentation for memory pruning configuration, metrics, and multi-replica deployment
src/vllm-sr/cli/templates/config.template.yaml Updated config template with new pruning parameters and two-path strategy explanation
src/semantic-router/pkg/memory/pruner_test.go Added comprehensive test coverage for pruning functionality including cap enforcement, sweep operations, and edge cases
src/semantic-router/pkg/memory/pruner.go Implemented background sweep goroutine with batch processing and graceful shutdown
src/semantic-router/pkg/memory/prune_metrics.go Defined Prometheus metrics for tracking pruning operations and performance
src/semantic-router/pkg/memory/milvus_store.go Added event-driven cap enforcement, helper methods for counting and querying stale memories
src/semantic-router/pkg/extproc/server.go Added graceful shutdown of prune sweep goroutine in Stop()
src/semantic-router/pkg/extproc/router.go Integrated prune sweep startup and added StopPruneSweep field to router
src/semantic-router/pkg/config/config.go Added configuration fields for prune interval, batch size, and sweep enablement

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread website/docs/proposals/agentic-memory.md Outdated
Comment thread website/docs/installation/configuration.md Outdated

// Path 1: event-driven cap enforcement — async prune if user exceeds max_memories_per_user
if m.config.QualityScoring.MaxMemoriesPerUser > 0 {
go m.pruneIfOverCap(context.Background(), memory.UserID)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using context.Background() in a goroutine ignores the parent context's cancellation and deadline. Consider propagating a detached context derived from the parent (e.g., using a custom function to extract values without cancellation) or document why ignoring cancellation is acceptable for async pruning.

Suggested change
go m.pruneIfOverCap(context.Background(), memory.UserID)
go m.pruneIfOverCap(ctx, memory.UserID)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.Background() is intentional here. The goroutine must outlive the HTTP/gRPC request — if we passed ctx, the pruning would be cancelled as soon as Store() returns to the caller

@abdallahsamabd abdallahsamabd force-pushed the feat/1350 branch 6 times, most recently from 7b5ed7f to 87c6c36 Compare February 24, 2026 14:01
@rootfs
Copy link
Copy Markdown
Collaborator

rootfs commented Feb 24, 2026

@abdallahsamabd @yehudit1987 Let's have a design review on memory pruning.

  • The pruning is not just a storage issue, it is a semantic issue too: if the related context is pruned, the remaining memory could be corrupted.
  • The pruning needs to scale wrt the users.

@abdallahsamabd
Copy link
Copy Markdown
Collaborator Author

Hi @rootfs @yehudit1987
please review this design document
memory-pruning-design.html
thanks

@rootfs
Copy link
Copy Markdown
Collaborator

rootfs commented Feb 25, 2026

@abdallahsamabd thanks for having the design doc.

Memory injection has to be dealt with care. Since the router makes decisions on behalf of the users, injecting conflict/wrong/stale memory will have poor consequences (see this). This is the top concern at the moment, pruning will be after that.

For any injection and pruning strategy, we need mitigate the risk by using well validated, highly cited research, rather than hand wavy ideas. The memory bank solution makes that cut. If you can support any of your PRs on that basis, it would make them much stronger.

@abdallahsamabd
Copy link
Copy Markdown
Collaborator Author

abdallahsamabd commented Feb 25, 2026

@rootfs
The retention scoring and pruning in PR #1373 is directly based on the MemoryBank paper (Zhong et al., arXiv:2305.10250), which uses the Ebbinghaus forgetting curve for memory lifecycle management:

  • R = exp(-t/S), where t = days since last access, S = S0 + access_count
  • Memories that are retrieved frequently build up strength (higher S), decaying slower
  • Memories below the threshold R < 0.1 are pruned

regarding memory injection, I opened this issue #1386

@abdallahsamabd abdallahsamabd force-pushed the feat/1350 branch 2 times, most recently from 91e9850 to ad1df7f Compare March 8, 2026 10:41
@abdallahsamabd abdallahsamabd force-pushed the feat/1350 branch 6 times, most recently from b5a0888 to bfe7794 Compare March 8, 2026 20:41
…lm-project#1350)

Signed-off-by: Abdallah Samara <abdallahsamabd@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants