Fix/18143 flush mm cache by edwingao28 · Pull Request #18942 · sgl-project/sglang

edwingao28 · 2026-02-17T23:08:29Z

Motivation

The existing /flush_cache endpoint clears the KV pool, radix cache, and grammar manager, but does not flush the MM embedding cache. This means stale MM embeddings can persist across cache flushes, which is unexpected for users who rely on /flush_cache to fully reset server state.

Modifications

fix: add flush_mm_embedding_cache() and invoke it from Scheduler.flush_cache().
Endpoint message update: /flush_cache now states MM embedding cahce is included
Test coverages: I created 8 tests with three unit tests, two mocked scheduler, and 3 integration including in-flight 400 case.

Accuracy Tests

N/A

Benchmarking and Profiling

N/A

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…t#18143)

… flush_cache

gemini-code-assist · 2026-02-17T23:08:47Z

Summary of Changes

Hello @edwingao28, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the /flush_cache endpoint did not clear the multimodal embedding cache, which could lead to stale data. The changes integrate the MM embedding cache into the existing flush mechanism, ensuring that the server state can be fully reset as intended. This enhancement improves data consistency and provides a more reliable cache management experience for users.

Highlights

Multimodal Embedding Cache Flushing: The /flush_cache endpoint now correctly clears the multimodal (MM) embedding cache, preventing stale data persistence and ensuring a complete server state reset.
API Response Update: The /flush_cache endpoint's response message has been updated to explicitly state that the MM embedding cache is included in the flush operation, providing clearer user feedback.
Comprehensive Test Coverage: New unit, mocked scheduler, and integration tests have been added to ensure the correct functionality and robustness of the MM embedding cache flushing mechanism, including handling scenarios with in-flight requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/entrypoints/http_server.py
- Updated the docstring for the /flush_cache endpoint to mention multimodal embedding cache.
- Modified the response content of the /flush_cache endpoint to explicitly state that the MM embedding cache is flushed.
python/sglang/srt/managers/mm_utils.py
- Added a new function flush_mm_embedding_cache to clear the global multimodal embedding cache.
python/sglang/srt/managers/scheduler.py
- Imported the new flush_mm_embedding_cache function.
- Invoked flush_mm_embedding_cache within the Scheduler.flush_cache method when no requests are pending.
test/registered/core/test_flush_mm_cache.py
- Added a new test file containing unit tests for flush_mm_embedding_cache.
- Included mocked tests for Scheduler.flush_cache integration.
- Provided integration tests for the /flush_cache endpoint, including a scenario with in-flight requests.

Activity

The author, edwingao28, created 8 new tests, including three unit tests, two mocked scheduler tests, and three integration tests (one covering an in-flight 400 case), to ensure the robustness of the new functionality.
No review comments or specific activity from other users have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request correctly addresses the issue where multimodal (MM) embedding cache was not cleared during a cache flush. By adding flush_mm_embedding_cache and invoking it within the Scheduler.flush_cache method, the PR ensures that stale MM embeddings do not persist across flushes, which is essential for a full reset of the server state. The implementation is safe as it is guarded by the _is_no_request() check, preventing flushes during active processing. The updated endpoint message and the new test suite (covering unit, mock, and integration scenarios) are well-implemented and provide good confidence in the changes.

edwingao28 · 2026-02-17T23:14:33Z

I runned my unittest on NVIDIA H100 80GB HBM3 and all 8 tests passed. I am using Qwen2.5-1.5B-Instruct for integration test.

vincentzed · 2026-02-18T12:59:07Z

Can you try with a VL model like https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct. It shouldn't do anything on text.
Also test with larger sgl vlm cache size.

edwingao28 · 2026-02-18T20:28:08Z

Can you try with a VL model like https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct. It shouldn't do anything on text. Also test with larger sgl vlm cache size.

Sure would do that today

edwingao28 · 2026-02-19T05:57:50Z

Hi @vincentzed,

I tested /flush_cache with the VL model Qwen3-VL-8B-Instruct and with a larger VLM cache size.

Configuration:

Model: Qwen3-VL-8B-Instruct
VLM cache size: I tested with SGLANG_VLM_CACHE_SIZE_MB=512 SGLANG_VLM_CACHE_SIZE_MB=1024 and SGLANG_VLM_CACHE_SIZE_MB=4096

Results:

Text-only requests:
- /flush_cache succeeds
- Text generation works normally afterward
- No MM cache interaction, as expected
Multimodal requests:
- Image inference works correctly
- /flush_cache succeeds and clears MM embedding cache
- Multimodal inference continues to work correctly after flushing
Integration tests:
- All TestFlushCacheEndpoint tests passed
- All unit and scheduler tests passed

Full terminal logs are attached for reference. flush_cache_vl_1024 had integration test with 4096md in the end
flush_cache_vl_512.log
flush_cache_vl_1024.log

edwingao28 · 2026-02-19T07:56:48Z

In addition to the current MM cache flush implementation, I noticed a few follow-ups while reading the codebase:

Standalone /flush_mm_cache endpoint
SGLang already exposes granular endpoints for other resources (KV cache, HiCache, profiling, LoRA), so adding a dedicated MM cache flush endpoint would align with existing conventions. This would also allow operators to reclaim MM cache memory without flushing KV/radix caches.
Disaggregation encode_server flush path
encode_server.py creates its own MultiModalStaticCache instance, but currently exposes no flush endpoint. This means MM cache in EPD deployments cannot be flushed, which could lead to inconsistent behavior compared to the main scheduler.
DP aggregation consistency
In tokenizer_communicator_mixin.py, flush aggregation currently returns [0] without using merge_results or asserting dp_size == 1. While DP workers typically behave identically, aligning this with the merge_results pattern would improve correctness and consistency.
Flush diagnostics visibility
It may also be useful to log MM cache size before and after flush to improve observability and assist debugging.

Happy to help work on any of these if they are considered useful.

edwingao28 added 3 commits February 16, 2026 01:48

feat: flush MM embedding cache when flush_cache is called (sgl-projec…

1c6e577

…t#18143)

test: add unit and integration tests for flush_mm_embedding_cache and…

66265bf

… flush_cache

fix: add/change comments to follow codebase conventions

9011caa

edwingao28 requested review from CatherineSue, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners February 17, 2026 23:08

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

edwingao28 changed the title ~~Fix/18141 flush mm cache~~ Fix/18143 flush mm cache Feb 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/18143 flush mm cache#18942

Fix/18143 flush mm cache#18942
edwingao28 wants to merge 3 commits intosgl-project:mainfrom
edwingao28:fix/18141-flush-mm-cache

edwingao28 commented Feb 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

edwingao28 commented Feb 17, 2026

Uh oh!

vincentzed commented Feb 18, 2026

Uh oh!

edwingao28 commented Feb 18, 2026

Uh oh!

edwingao28 commented Feb 19, 2026

Uh oh!

edwingao28 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

edwingao28 commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

edwingao28 commented Feb 17, 2026

Uh oh!

vincentzed commented Feb 18, 2026

Uh oh!

edwingao28 commented Feb 18, 2026

Uh oh!

edwingao28 commented Feb 19, 2026

Uh oh!

edwingao28 commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

edwingao28 commented Feb 17, 2026 •

edited

Loading