Skip to content

Support dynamic re-configure, enhance override_enable, and env var in dumper#19018

Open
fzyzcjy wants to merge 17 commits intosgl-project:mainfrom
fzyzcjy:ac8398/3
Open

Support dynamic re-configure, enhance override_enable, and env var in dumper#19018
fzyzcjy wants to merge 17 commits intosgl-project:mainfrom
fzyzcjy:ac8398/3

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 19, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

When SGLANG_DUMPER_CLEANUP=1, all existing sglang_dump_* directories
under the base dir are removed before the first dump write to disk.
Rank 0 performs the cleanup with a distributed barrier for sync.
Collective operations (broadcast_object_list, all_gather_object) hang
silently when not all ranks participate. Add a configurable timeout
(default 60s) that prints a warning if a collective op doesn't complete,
helping users diagnose missing rank participation.
Add unit test (TestCollectiveTimeout) verifying the watchdog fires and
prints a warning when a collective op exceeds the timeout, and a
distributed test in TestDumperDistributed exercising the real
on_forward_pass_start broadcast with staggered rank joins.
… OUTPUT_CONSOLE

Replace single `SGLANG_DUMPER_WRITE_FILE` env var with two independent controls:
- `SGLANG_DUMPER_OUTPUT_FILE` (default 1): controls writing .pt files to disk
- `SGLANG_DUMPER_OUTPUT_CONSOLE` (default 1): controls printing dump info to stdout

This lets users disable console output independently from file output.
… capture

Captures dump data into a dict instead of writing to disk. The capture
happens at the _dump_single save point, so enable/filter checks and
console output still apply normally. Values are cloned to prevent
in-place mutation after capture.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and robustness of the dumper utility by generalizing its configuration and introducing new output control mechanisms. It allows users to precisely control where and how dump data is handled, whether through files, console output, or in-memory capture. Additionally, it improves the reliability of distributed operations within the dumper by implementing timeouts for collective calls and adds a convenient feature for automatically cleaning up old dump files.

Highlights

  • Enhanced Dumper Configuration: Introduced new configuration options for the dumper, allowing separate control over writing output to files (enable_output_file) and printing to the console (enable_output_console). Previously, only file writing could be enabled/disabled.
  • Output Capture Mechanism: Added a capture_output context manager to the dumper, enabling programmatic capture of dumped data into memory. This is particularly useful for testing and in-memory analysis without writing to disk.
  • Automatic Dump Cleanup: Implemented a cleanup_previous option that, when enabled, automatically removes old dump directories before the first write operation, helping to manage disk space.
  • Distributed Collective Operation Timeout: Integrated a collective_with_timeout mechanism for distributed operations within the dumper (e.g., _get_partial_name, RPC setup). This adds robustness by preventing hangs if not all ranks participate in a collective operation, printing a warning instead.
  • Refactored Output Handling: Modified the _dump_single method to conditionally print to console and write to file based on the new configuration, and to handle captured output data.
  • Comprehensive Testing: Added new test cases to cover the new output control options, the capture_output context manager, the collective_with_timeout functionality, and the automatic cleanup feature, ensuring the robustness of the dumper utility.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/debug_utils/dumper.py
    • Imported contextlib for context managers.
    • Renamed enable_write_file to enable_output_file and added enable_output_console, cleanup_previous, and collective_timeout parameters to the _Dumper constructor.
    • Updated _Dumper.from_env to read new environment variables SGLANG_DUMPER_OUTPUT_FILE, SGLANG_DUMPER_OUTPUT_CONSOLE, and SGLANG_DUMPER_CLEANUP_PREVIOUS.
    • Introduced _captured_output_data and _pending_cleanup state variables.
    • Modified _ensure_http_server and _ensure_partial_name to accept and utilize timeout_seconds.
    • Added a capture_output context manager for in-memory data capture.
    • Updated _dump_single to conditionally print to console, handle captured output, and trigger cleanup of old dumps.
    • Implemented _collective_with_timeout function to add timeout functionality to distributed collective operations.
    • Modified _get_partial_name to use _collective_with_timeout.
    • Added _cleanup_old_dumps function to remove previous dump directories.
    • Introduced _deepcopy_or_clone helper function for handling tensor cloning or deep copying.
    • Updated _start_maybe_http_server and _create_zmq_rpc_handles to pass timeout_seconds and use _collective_with_timeout for distributed calls.
  • test/registered/debug_utils/test_dumper.py
    • Added imports for io, sys, threading, and contextlib.
    • Imported _collective_with_timeout for testing.
    • Created _capture_stdout context manager for testing console output.
    • Added TestCollectiveTimeout class to test the new _collective_with_timeout function.
    • Introduced test_collective_timeout and _test_collective_timeout_func within TestDumperDistributed to verify distributed timeout behavior.
    • Removed the test_write_disabled test case.
    • Added TestOutputControl class with tests for enable_output_file, enable_output_console, and capture_output functionality, including cloning of captured values and filter adherence.
    • Added TestCleanup class to verify the cleanup_previous functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fzyzcjy fzyzcjy changed the title Generalize configuration and override_enable in dumper Support dynamic re-configure and enhance override_enable in dumper Feb 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the dumper utility by generalizing its configuration and improving its robustness in distributed environments. Key changes include adding options to control console and file output, a feature to capture dumps in memory, and a mechanism to clean up old dumps. A major improvement is the introduction of a watchdog with a configurable timeout for collective operations, which will help diagnose hangs in distributed runs. The changes are well-tested with new unit and distributed tests. My only suggestion is to make the collective_timeout configurable via an environment variable for greater flexibility.

"SGLANG_ENABLE_DUMPER_HTTP_SERVER", "1"
),
cleanup_previous=get_bool_env_var("SGLANG_DUMPER_CLEANUP_PREVIOUS", "0"),
collective_timeout=60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The collective_timeout is hardcoded to 60 seconds. While this is a reasonable default, it would be more flexible to allow users to configure this value through an environment variable, similar to other settings. This is particularly useful in environments with slower networks or for debugging complex distributed scenarios where collectives might take longer than expected.

Suggested change
collective_timeout=60,
collective_timeout=get_int_env_var("SGLANG_DUMPER_COLLECTIVE_TIMEOUT", 60),

@fzyzcjy fzyzcjy changed the title Support dynamic re-configure and enhance override_enable in dumper Support dynamic re-configure, enhance override_enable, and env var in dumper Feb 19, 2026
Move all 13 _Dumper init parameters into a @DataClass(frozen=True)
_DumperConfig. _Dumper.__init__ now takes a single config parameter.

- Runtime mutations (enable via HTTP, lazy partial_name) use
  dataclasses.replace() to swap the frozen config
- _DumperConfig.from_env() centralizes env var parsing with defaults
  matching field defaults (verified by new UT)
- Rename _pending_cleanup to _cleanup_previous_handled for consistency
  with _http_server_handled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments