Nodes crash monitoring#29
Merged
Merged
Conversation
…mmirage into nodes-crash-monitoring
…into nodes-crash-monitoring
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…into nodes-crash-monitoring
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request updates documentation, configuration files, and scripts to improve usability, clarify execution modes, and simplify running the MMIRAGE pipeline. The most important changes are grouped below by theme.
Documentation and Usage Improvements:
README.mdwith clearer instructions for running the pipeline, checking status, and retrying failed shards, including new CLI command examples and explanations of retry behavior.README.mdto includestate_dirandexecution_params, and clarified how to control sharding and retries. [1] [2] [3] [4]Nodes crash monitoring:
status.jsoncustom fileretry=TrueConfiguration Enhancements:
configs/config_comprehensive.yamldocumenting all available parameters, including detailed comments for each section.configs/config_mock.yaml,configs/config_mock_vision.yaml) to includestate_dir, setnum_shardsto 1 for local runs, and addedexecution_paramsfor explicit execution mode and retry control. [1] [2] [3] [4]Script and CLI Improvements:
run.shto delegate all execution logic to the Python CLI, removing direct SLURM commands and making the script config-driven.mmirageCLI entrypoint inpyproject.tomlfor easier command-line usage.runentrypointcheckentrypointCodebase Organization:
src/mmirage/__init__.pyto reflect module restructuring and ensure all relevant configuration classes are exposed at the package level.