Skip to content

Nodes crash monitoring#29

Merged
fabnemEPFL merged 50 commits into
mainfrom
nodes-crash-monitoring
Mar 25, 2026
Merged

Nodes crash monitoring#29
fabnemEPFL merged 50 commits into
mainfrom
nodes-crash-monitoring

Conversation

@qchapp
Copy link
Copy Markdown
Member

@qchapp qchapp commented Mar 22, 2026

This pull request updates documentation, configuration files, and scripts to improve usability, clarify execution modes, and simplify running the MMIRAGE pipeline. The most important changes are grouped below by theme.

Documentation and Usage Improvements:

  • Expanded the README.md with clearer instructions for running the pipeline, checking status, and retrying failed shards, including new CLI command examples and explanations of retry behavior.
  • Updated YAML configuration examples in the README.md to include state_dir and execution_params, and clarified how to control sharding and retries. [1] [2] [3] [4]

Nodes crash monitoring:

  • Shards status are now tracked in a status.json custom file
  • Shards failed can be automatically relaunched via a specific config parameter retry=True

Configuration Enhancements:

  • Added a comprehensive example config file configs/config_comprehensive.yaml documenting all available parameters, including detailed comments for each section.
  • Updated mock config files (configs/config_mock.yaml, configs/config_mock_vision.yaml) to include state_dir, set num_shards to 1 for local runs, and added execution_params for explicit execution mode and retry control. [1] [2] [3] [4]

Script and CLI Improvements:

  • Deleted run.sh to delegate all execution logic to the Python CLI, removing direct SLURM commands and making the script config-driven.
  • Registered a mmirage CLI entrypoint in pyproject.toml for easier command-line usage.
  • Jobs can be launched using the run entrypoint
  • Shards status can be checked using check entrypoint

Codebase Organization:

  • Updated imports in src/mmirage/__init__.py to reflect module restructuring and ensure all relevant configuration classes are exposed at the package level.

Copilot AI review requested due to automatic review settings March 22, 2026 19:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mmirage/cli.py Outdated
Comment thread src/mmirage/shard_process.py Outdated
Comment thread src/mmirage/config/loading.py
Comment thread src/mmirage/shard_utils.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mmirage/shard_utils.py Outdated
@fabnemEPFL fabnemEPFL merged commit 158586e into main Mar 25, 2026
0 of 2 checks passed
@fabnemEPFL fabnemEPFL deleted the nodes-crash-monitoring branch March 25, 2026 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a proper CLI for shard_process and merge_inputs utilities Nodes crashing during runtime

3 participants