Skip to content

Support captured dump output and console output control in dumper#19017

Open
fzyzcjy wants to merge 16 commits intosgl-project:mainfrom
fzyzcjy:ac8398/2
Open

Support captured dump output and console output control in dumper#19017
fzyzcjy wants to merge 16 commits intosgl-project:mainfrom
fzyzcjy:ac8398/2

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 19, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

When SGLANG_DUMPER_CLEANUP=1, all existing sglang_dump_* directories
under the base dir are removed before the first dump write to disk.
Rank 0 performs the cleanup with a distributed barrier for sync.
Collective operations (broadcast_object_list, all_gather_object) hang
silently when not all ranks participate. Add a configurable timeout
(default 60s) that prints a warning if a collective op doesn't complete,
helping users diagnose missing rank participation.
Add unit test (TestCollectiveTimeout) verifying the watchdog fires and
prints a warning when a collective op exceeds the timeout, and a
distributed test in TestDumperDistributed exercising the real
on_forward_pass_start broadcast with staggered rank joins.
… OUTPUT_CONSOLE

Replace single `SGLANG_DUMPER_WRITE_FILE` env var with two independent controls:
- `SGLANG_DUMPER_OUTPUT_FILE` (default 1): controls writing .pt files to disk
- `SGLANG_DUMPER_OUTPUT_CONSOLE` (default 1): controls printing dump info to stdout

This lets users disable console output independently from file output.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the dumper utility by providing more granular control over its output behavior and enhancing its robustness in distributed environments. Users can now independently enable or disable writing dumps to files and logging to the console. Additionally, a new feature allows for automatic cleanup of previous dump directories, and a timeout mechanism has been integrated into collective distributed operations to prevent potential deadlocks, making the dumper more reliable and user-friendly.

Highlights

  • Enhanced Output Control: The dumper now offers granular control over its output, allowing users to independently enable or disable writing dump files and logging to the console via new configuration parameters and environment variables.
  • Automatic Dump Cleanup: A new feature has been added to automatically clean up previous dump directories at the start of a new dumper session, preventing accumulation of old debug data. This can be controlled via the SGLANG_DUMPER_CLEANUP_PREVIOUS environment variable.
  • Distributed Operation Timeout: A collective_timeout mechanism has been introduced for distributed operations within the dumper, such as broadcast_object_list and all_gather_object, to prevent indefinite hangs if not all ranks participate.
  • Configuration Refactoring: Internal dumper configuration parameters, such as enable_write_file, have been renamed to enable_output_file for improved clarity and consistency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/debug_utils/dumper.py
    • Added SGLANG_DUMPER_CLEANUP_PREVIOUS environment variable for automatic cleanup.
    • Renamed enable_write_file to enable_output_file and introduced enable_output_console for finer output control.
    • Included cleanup_previous and collective_timeout parameters in the _Dumper class constructor.
    • Implemented _pending_cleanup state to manage initial dump directory cleanup.
    • Updated _Dumper.from_env to load new configuration from environment variables.
    • Modified _ensure_http_server and _ensure_partial_name to pass collective_timeout.
    • Conditionalized console print statements based on _enable_output_console.
    • Integrated cleanup logic into _dump_single before the first file write.
    • Introduced _collective_with_timeout to add watchdog functionality to distributed operations.
    • Applied _collective_with_timeout to dist.broadcast_object_list in _get_partial_name.
    • Added _cleanup_old_dumps function to remove previous dump directories.
    • Updated _start_maybe_http_server and _create_zmq_rpc_handles to utilize the new timeout_seconds parameter.
    • Wrapped dist.all_gather_object in _create_zmq_rpc_handles with _collective_with_timeout.
  • test/registered/debug_utils/test_dumper.py
    • Imported necessary modules: io, threading, contextlib.
    • Imported _collective_with_timeout for testing.
    • Added _capture_stdout context manager to capture console output in tests.
    • Introduced TestCollectiveTimeout class with a test for the watchdog firing on timeout.
    • Added test_collective_timeout and its static method _test_collective_timeout_func to TestDumperDistributed to verify distributed timeout behavior.
    • Removed the test_write_disabled test case and its associated function.
    • Added TestOutputControl class to test the new enable_output_file and enable_output_console functionalities.
    • Added TestCleanup class to verify the cleanup_previous functionality, ensuring old dumps are removed or retained as expected.
Activity
  • No specific activity (comments, reviews, progress) was provided in the pull request description or context. The description is a template.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

… capture

Captures dump data into a dict instead of writing to disk. The capture
happens at the _dump_single save point, so enable/filter checks and
console output still apply normally. Values are cloned to prevent
in-place mutation after capture.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several enhancements to the dumper utility. Key additions include options to control console and file output separately, and a feature to automatically clean up previous dump directories. A significant improvement is the new _collective_with_timeout wrapper, which adds a watchdog with a timeout to distributed collective operations, helping to debug hangs. The changes are well-implemented and accompanied by thorough tests for the new functionality. My only suggestion is to make the collective operation timeout configurable via an environment variable for better flexibility.

"SGLANG_ENABLE_DUMPER_HTTP_SERVER", "1"
),
cleanup_previous=get_bool_env_var("SGLANG_DUMPER_CLEANUP_PREVIOUS", "0"),
collective_timeout=60,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The collective_timeout is hardcoded to 60 seconds. While this is a reasonable default, it would be more flexible and consistent with other parameters in this class to make it configurable via an environment variable, for example SGLANG_DUMPER_COLLECTIVE_TIMEOUT. This would allow users to adjust the timeout for different environments or debugging scenarios without changing the code.

Suggested change
collective_timeout=60,
collective_timeout=get_int_env_var("SGLANG_DUMPER_COLLECTIVE_TIMEOUT", 60),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments