-
Notifications
You must be signed in to change notification settings - Fork 3
Nodes crash monitoring #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 37 commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
c303192
first version with auto retry to test on real data
qchapp c4f32fe
config is needed as well
qchapp 7e16532
simplified scripts to test for first version
qchapp a62b1ad
remove old dataset if we need to clean up
qchapp a45f795
testing something
qchapp 06ea67c
something missing
qchapp e8aa35a
fixed maybe
qchapp 186e6b5
removing check to test something
qchapp 3925527
fixed number of shards
qchapp 5b849db
working!
qchapp 83096b3
removed wrong part of the doc
qchapp 53d7c04
ready for real medtrinity processing
qchapp 6dbc1cf
full run with retry
qchapp 2534c2a
full run with retry changed
qchapp 3c63206
trying a more robust approach
qchapp 6993969
added state dir in config
qchapp f992911
preparing for PR
qchapp c8b51a9
preparing for PR
qchapp 49cfc2c
maybe now
qchapp 7d0d2c3
working on the test data
qchapp c33cf82
working on the test data
fba95d9
Merge branch 'nodes-crash-monitoring' of https://github.com/EPFLiGHT/…
8ec6613
updated readme with modified changes
qchapp d5a8527
trying with a cli to provide an easier way to use
qchapp e4cb7ac
big refactor for the PR
qchapp 70799ac
fixing number of shards for local configs
qchapp 34f6595
added retry to local mode as well
qchapp 9951259
added config changes to readme as well
qchapp 19bf689
ready for PR
qchapp f613fa1
Update src/mmirage/config/loading.py
qchapp 555405b
Update src/mmirage/shard_process.py
qchapp ce41693
Update src/mmirage/cli_utils/status.py
qchapp c4c23c9
Update src/mmirage/cli_utils/slurm.py
qchapp b5c094f
implemented changes proposed by copilot
qchapp d90abf9
implemented more changes proposed by copilot
qchapp 6eb3d29
deleted run.sh and implemented a default state_dir value
qchapp b64711e
forgot to update the readme
qchapp c966a1b
changes suggested in the PR
qchapp 09708fa
removed a changes to avoid circular imports
qchapp a643c7d
added some logging
qchapp 5dca2b9
changed print to log for job_id information
qchapp 73ffb32
changed logging again
qchapp c2cca43
lambda in __post_init__ to avoid exposing a function used in only one…
fabnemEPFL eaff903
Merge branch 'nodes-crash-monitoring' of github.com:EPFLiGHT/MMIRAGE …
fabnemEPFL 4e00d8b
Update src/mmirage/shard_process.py
qchapp 46aae89
Update src/mmirage/shard_utils.py
qchapp 4be5403
fixed small issue
qchapp 4215283
better style
fabnemEPFL 8f9134d
Merge branch 'nodes-crash-monitoring' of github.com:EPFLiGHT/MMIRAGE …
fabnemEPFL 6b3994a
change proposed by copilot
qchapp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -168,3 +168,6 @@ else/ | |
| # Test outputs | ||
| tests/mock_data/output/ | ||
| tests/mock_data/shards/ | ||
|
|
||
| # devcontainer | ||
| .devcontainer/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,200 @@ | ||
| # MMIRAGE Configuration with all parameters | ||
| # | ||
| # This is a comprehensive example showing all available configuration options. | ||
| # You can copy and modify this file for your specific use case. | ||
| # | ||
| # Parameters are organized into sections: | ||
| # 1. processors - LLM and other data transformation processors | ||
| # 2. loading_params - Dataset loading and sharding configuration | ||
| # 3. processing_params - How to transform/process the data | ||
| # 4. execution_params - SLURM, retry, and execution settings | ||
| # | ||
|
|
||
| # ============================================================================ | ||
| # PROCESSORS CONFIGURATION | ||
| # ============================================================================ | ||
| # Define the processors used to transform your data. | ||
| # Common types: llm, vision_llm, etc. | ||
|
|
||
| processors: | ||
| - type: llm | ||
| server_args: | ||
| model_path: Qwen/Qwen3-4B-Instruct-2507 | ||
| tp_size: 1 | ||
| disable_custom_all_reduce: true | ||
| default_sampling_params: | ||
| temperature: 0.1 | ||
| top_p: 0.9 | ||
| max_new_tokens: 1024 | ||
| custom_params: | ||
| chat_template_kwargs: | ||
| enable_thinking: false | ||
|
|
||
|
|
||
| # ============================================================================ | ||
| # LOADING PARAMETERS | ||
| # ============================================================================ | ||
| # Configure how datasets are loaded, sharded, and processed. | ||
|
|
||
| loading_params: | ||
| # Directory to store pipeline state (checkpoints, status, retry tracking) | ||
| # Supports environment variables: $VAR or ${VAR} | ||
| state_dir: tests/output/data/_pipeline_state | ||
|
|
||
| # Dataset configurations to load | ||
| # Each dataset can be separately sharded and output | ||
| datasets: | ||
| - path: tests/mock_data/data.jsonl | ||
| type: JSONL | ||
| output_dir: tests/output/data | ||
| # image_base_path: /path/to/images # Optional, for vision tasks | ||
|
|
||
| # Total number of shards to split datasets into. | ||
| # For SLURM, this determines the array job size. | ||
| num_shards: 4 | ||
|
|
||
| # Shard ID for this process (0-indexed). | ||
| # In SLURM array jobs, this is set automatically. | ||
| shard_id: "$SLURM_ARRAY_TASK_ID" | ||
|
|
||
| # Batch size for processing samples | ||
| batch_size: 64 | ||
|
|
||
|
|
||
| # ============================================================================ | ||
| # PROCESSING PARAMETERS | ||
| # ============================================================================ | ||
| # Define what to extract, transform, and output from each sample. | ||
|
|
||
| processing_params: | ||
| # Input variables to extract from source data | ||
| inputs: | ||
| - name: text | ||
| key: text | ||
| # For vision examples: | ||
| # - name: image | ||
| # key: image_path | ||
| # type: image | ||
|
|
||
| # Output variables generated by processors | ||
| outputs: | ||
| - name: formatted_answer | ||
| type: llm | ||
| output_type: JSON | ||
| output_schema: | ||
| - question | ||
| - answer | ||
| prompt: | | ||
| Generate one question and its corresponding answer using the following text: | ||
| ``` | ||
| {{ text }} | ||
| ``` | ||
|
|
||
| # Whether to remove original columns from the dataset | ||
| remove_columns: true | ||
|
|
||
| # Output schema: how to structure the final dataset | ||
| output_schema: | ||
| conversations: | ||
| - role: "user" | ||
| content: "{{ formatted_answer.question }}" | ||
| - role: "assistant" | ||
| content: "{{ formatted_answer.answer }}" | ||
|
|
||
|
|
||
| # ============================================================================ | ||
| # EXECUTION PARAMETERS | ||
| # ============================================================================ | ||
| # Configure how to execute the pipeline: locally or on SLURM cluster. | ||
| # All parameters here are optional with sensible defaults. | ||
|
|
||
| execution_params: | ||
| # Execution mode: "local" or "slurm" | ||
| # - local: Run directly on this machine | ||
| # - slurm: Submit jobs to SLURM cluster | ||
| mode: slurm | ||
|
|
||
| # Whether the canonical `run` command should automatically retry failed shards. | ||
| # - false: submit one run only | ||
| # - true: submit, wait, and keep retrying failed shards until success or retry budget exhaustion | ||
| retry: true | ||
|
|
||
| # Maximum number of times to retry a failed shard (default: 3) | ||
| max_retries: 3 | ||
|
|
||
| # ========================================================================== | ||
| # SLURM CONFIGURATION (only used when mode: slurm) | ||
| # ========================================================================== | ||
|
|
||
| # HPC account/partition to charge jobs to (REQUIRED for SLURM mode) | ||
| account: a127 | ||
|
|
||
| # SLURM job name (default: "mmirage-sharded") | ||
| job_name: mmirage-sharded | ||
|
|
||
| # Optional SLURM reservation name (leave blank or omit to not use) | ||
| # reservation: "sai-a127" | ||
|
|
||
| # Number of nodes (default: 1) | ||
| nodes: 1 | ||
|
|
||
| # Number of tasks per node (default: 1) | ||
| ntasks_per_node: 1 | ||
|
|
||
| # Number of GPUs per node (default: 4) | ||
| gpus: 4 | ||
|
|
||
| # Number of CPUs per task (default: 288) | ||
| cpus_per_task: 288 | ||
|
|
||
| # Job time limit in HH:MM:SS format (default: "11:59:59") | ||
| time_limit: "11:59:59" | ||
|
|
||
| # ========================================================================== | ||
| # PATH CONFIGURATION | ||
| # ========================================================================== | ||
| # These support environment variables ($VAR or ${VAR}) and home directory (~) | ||
|
|
||
| # Project root directory (used as base for relative paths) | ||
| # If not set, uses current working directory | ||
| # project_root: "/path/to/project" | ||
|
|
||
| # Directory for SLURM output and error files (default: ~/reports) | ||
| report_dir: "/users/${USER}/reports" | ||
|
|
||
| # HuggingFace cache directory (default: ~/hf) | ||
| hf_home: "/capstor/store/cscs/swissai/a127/homes/${USER}/hf" | ||
|
|
||
| # EDF environment file path for cluster-specific setup | ||
| edf_env: "/users/${USER}/.edf/mmirage.toml" | ||
|
|
||
| # ========================================================================== | ||
| # JOB MONITORING (for "submit" and retry orchestration) | ||
| # ========================================================================== | ||
|
|
||
| # Seconds to wait between checking job status (default: 30) | ||
| poll_interval_seconds: 30 | ||
|
|
||
| # Seconds to wait after job completes before checking results (default: 60) | ||
| # This allows filesystem to settle on distributed systems | ||
| settle_time_seconds: 60 | ||
|
|
||
|
|
||
| # ============================================================================ | ||
| # USAGE EXAMPLES | ||
| # ============================================================================ | ||
| # | ||
| # 1. Canonical entrypoint (local or SLURM; retry controlled by config): | ||
| # python -m mmirage.cli run --config config.yaml | ||
| # | ||
| # 2. Submit job to SLURM with wait for completion: | ||
| # python -m mmirage.cli submit --config config.yaml --wait | ||
| # | ||
| # 3. Submit job and get job ID back (for scripting): | ||
| # JOB_ID=$(python -m mmirage.cli submit --config config.yaml) | ||
| # | ||
| # 4. Run a single shard locally: | ||
| # python -m mmirage.cli run --config config.yaml --shard-id 0 | ||
| # | ||
| # 5. Check status of all shards (and optionally submit retries): | ||
| # python -m mmirage.cli check --config config.yaml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.