Skip to content

Fix/18143 flush mm cache#18942

Open
edwingao28 wants to merge 3 commits intosgl-project:mainfrom
edwingao28:fix/18141-flush-mm-cache
Open

Fix/18143 flush mm cache#18942
edwingao28 wants to merge 3 commits intosgl-project:mainfrom
edwingao28:fix/18141-flush-mm-cache

Conversation

@edwingao28
Copy link
Contributor

@edwingao28 edwingao28 commented Feb 17, 2026

Motivation

Fix #18143

The existing /flush_cache endpoint clears the KV pool, radix cache, and grammar manager, but does not flush the MM embedding cache. This means stale MM embeddings can persist across cache flushes, which is unexpected for users who rely on /flush_cache to fully reset server state.

Modifications

  • fix: add flush_mm_embedding_cache() and invoke it from Scheduler.flush_cache().
  • Endpoint message update: /flush_cache now states MM embedding cahce is included
  • Test coverages: I created 8 tests with three unit tests, two mocked scheduler, and 3 integration including in-flight 400 case.

Accuracy Tests

N/A

Benchmarking and Profiling

N/A

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @edwingao28, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where the /flush_cache endpoint did not clear the multimodal embedding cache, which could lead to stale data. The changes integrate the MM embedding cache into the existing flush mechanism, ensuring that the server state can be fully reset as intended. This enhancement improves data consistency and provides a more reliable cache management experience for users.

Highlights

  • Multimodal Embedding Cache Flushing: The /flush_cache endpoint now correctly clears the multimodal (MM) embedding cache, preventing stale data persistence and ensuring a complete server state reset.
  • API Response Update: The /flush_cache endpoint's response message has been updated to explicitly state that the MM embedding cache is included in the flush operation, providing clearer user feedback.
  • Comprehensive Test Coverage: New unit, mocked scheduler, and integration tests have been added to ensure the correct functionality and robustness of the MM embedding cache flushing mechanism, including handling scenarios with in-flight requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/entrypoints/http_server.py
    • Updated the docstring for the /flush_cache endpoint to mention multimodal embedding cache.
    • Modified the response content of the /flush_cache endpoint to explicitly state that the MM embedding cache is flushed.
  • python/sglang/srt/managers/mm_utils.py
    • Added a new function flush_mm_embedding_cache to clear the global multimodal embedding cache.
  • python/sglang/srt/managers/scheduler.py
    • Imported the new flush_mm_embedding_cache function.
    • Invoked flush_mm_embedding_cache within the Scheduler.flush_cache method when no requests are pending.
  • test/registered/core/test_flush_mm_cache.py
    • Added a new test file containing unit tests for flush_mm_embedding_cache.
    • Included mocked tests for Scheduler.flush_cache integration.
    • Provided integration tests for the /flush_cache endpoint, including a scenario with in-flight requests.
Activity
  • The author, edwingao28, created 8 new tests, including three unit tests, two mocked scheduler tests, and three integration tests (one covering an in-flight 400 case), to ensure the robustness of the new functionality.
  • No review comments or specific activity from other users have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly addresses the issue where multimodal (MM) embedding cache was not cleared during a cache flush. By adding flush_mm_embedding_cache and invoking it within the Scheduler.flush_cache method, the PR ensures that stale MM embeddings do not persist across flushes, which is essential for a full reset of the server state. The implementation is safe as it is guarded by the _is_no_request() check, preventing flushes during active processing. The updated endpoint message and the new test suite (covering unit, mock, and integration scenarios) are well-implemented and provide good confidence in the changes.

@edwingao28
Copy link
Contributor Author

I runned my unittest on NVIDIA H100 80GB HBM3 and all 8 tests passed. I am using Qwen2.5-1.5B-Instruct for integration test.

@edwingao28 edwingao28 changed the title Fix/18141 flush mm cache Fix/18143 flush mm cache Feb 18, 2026
@vincentzed
Copy link
Contributor

Can you try with a VL model like https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct. It shouldn't do anything on text.
Also test with larger sgl vlm cache size.

@edwingao28
Copy link
Contributor Author

Can you try with a VL model like https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct. It shouldn't do anything on text. Also test with larger sgl vlm cache size.

Sure would do that today

@edwingao28
Copy link
Contributor Author

Hi @vincentzed,

I tested /flush_cache with the VL model Qwen3-VL-8B-Instruct and with a larger VLM cache size.

Configuration:

  • Model: Qwen3-VL-8B-Instruct
  • VLM cache size: I tested with SGLANG_VLM_CACHE_SIZE_MB=512 SGLANG_VLM_CACHE_SIZE_MB=1024 and SGLANG_VLM_CACHE_SIZE_MB=4096

Results:

  1. Text-only requests:

    • /flush_cache succeeds
    • Text generation works normally afterward
    • No MM cache interaction, as expected
  2. Multimodal requests:

    • Image inference works correctly
    • /flush_cache succeeds and clears MM embedding cache
    • Multimodal inference continues to work correctly after flushing
  3. Integration tests:

    • All TestFlushCacheEndpoint tests passed
    • All unit and scheduler tests passed

Full terminal logs are attached for reference. flush_cache_vl_1024 had integration test with 4096md in the end
flush_cache_vl_512.log
flush_cache_vl_1024.log

@edwingao28
Copy link
Contributor Author

In addition to the current MM cache flush implementation, I noticed a few follow-ups while reading the codebase:

  1. Standalone /flush_mm_cache endpoint
    SGLang already exposes granular endpoints for other resources (KV cache, HiCache, profiling, LoRA), so adding a dedicated MM cache flush endpoint would align with existing conventions. This would also allow operators to reclaim MM cache memory without flushing KV/radix caches.

  2. Disaggregation encode_server flush path
    encode_server.py creates its own MultiModalStaticCache instance, but currently exposes no flush endpoint. This means MM cache in EPD deployments cannot be flushed, which could lead to inconsistent behavior compared to the main scheduler.

  3. DP aggregation consistency
    In tokenizer_communicator_mixin.py, flush aggregation currently returns [0] without using merge_results or asserting dp_size == 1. While DP workers typically behave identically, aligning this with the merge_results pattern would improve correctness and consistency.

  4. Flush diagnostics visibility
    It may also be useful to log MM cache size before and after flush to improve observability and assist debugging.

Happy to help work on any of these if they are considered useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Flush MM cache

2 participants

Comments