Support dynamic re-configure, enhance override_enable, and env var in dumper#19018
Support dynamic re-configure, enhance override_enable, and env var in dumper#19018fzyzcjy wants to merge 17 commits intosgl-project:mainfrom
Conversation
When SGLANG_DUMPER_CLEANUP=1, all existing sglang_dump_* directories under the base dir are removed before the first dump write to disk. Rank 0 performs the cleanup with a distributed barrier for sync.
Collective operations (broadcast_object_list, all_gather_object) hang silently when not all ranks participate. Add a configurable timeout (default 60s) that prints a warning if a collective op doesn't complete, helping users diagnose missing rank participation.
Add unit test (TestCollectiveTimeout) verifying the watchdog fires and prints a warning when a collective op exceeds the timeout, and a distributed test in TestDumperDistributed exercising the real on_forward_pass_start broadcast with staggered rank joins.
(cherry picked from commit 402ad07)
… OUTPUT_CONSOLE Replace single `SGLANG_DUMPER_WRITE_FILE` env var with two independent controls: - `SGLANG_DUMPER_OUTPUT_FILE` (default 1): controls writing .pt files to disk - `SGLANG_DUMPER_OUTPUT_CONSOLE` (default 1): controls printing dump info to stdout This lets users disable console output independently from file output.
…stOutputControl class
…utControl.test_file_disabled
… capture Captures dump data into a dict instead of writing to disk. The capture happens at the _dump_single save point, so enable/filter checks and console output still apply normally. Values are cloned to prevent in-place mutation after capture.
Summary of ChangesHello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility and robustness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the dumper utility by generalizing its configuration and improving its robustness in distributed environments. Key changes include adding options to control console and file output, a feature to capture dumps in memory, and a mechanism to clean up old dumps. A major improvement is the introduction of a watchdog with a configurable timeout for collective operations, which will help diagnose hangs in distributed runs. The changes are well-tested with new unit and distributed tests. My only suggestion is to make the collective_timeout configurable via an environment variable for greater flexibility.
| "SGLANG_ENABLE_DUMPER_HTTP_SERVER", "1" | ||
| ), | ||
| cleanup_previous=get_bool_env_var("SGLANG_DUMPER_CLEANUP_PREVIOUS", "0"), | ||
| collective_timeout=60, |
There was a problem hiding this comment.
The collective_timeout is hardcoded to 60 seconds. While this is a reasonable default, it would be more flexible to allow users to configure this value through an environment variable, similar to other settings. This is particularly useful in environments with slower networks or for debugging complex distributed scenarios where collectives might take longer than expected.
| collective_timeout=60, | |
| collective_timeout=get_int_env_var("SGLANG_DUMPER_COLLECTIVE_TIMEOUT", 60), |
Move all 13 _Dumper init parameters into a @DataClass(frozen=True) _DumperConfig. _Dumper.__init__ now takes a single config parameter. - Runtime mutations (enable via HTTP, lazy partial_name) use dataclasses.replace() to swap the frozen config - _DumperConfig.from_env() centralizes env var parsing with defaults matching field defaults (verified by new UT) - Rename _pending_cleanup to _cleanup_previous_handled for consistency with _http_server_handled
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci