Skip to content

Conversation

@lebrice
Copy link
Collaborator

@lebrice lebrice commented Jul 16, 2025

Example of how to use accelerate with SLURM, based on https://github.com/huggingface/accelerate/blob/main/examples/by_feature/checkpointing.py

Compared to that example, here are the differences (benefits):

  • This example demonstrates how to implement code checkpointing: jobs are executed with the code as it was at the time of job submission, even if the code is updated later.
    • Clones the repository at the commit at the time of job submission in $SLURM_TMPDIR
    • Uses uv run in that directory to recreate the virtualenv in $SLURM_TMPDIR on each node in a multi-node job.
    • This is done by submitting the job with the safe_sbatch script instead of sbatch.
      Compared to sbatch, safe_sbatch will prevent submitting jobs if there are uncommitted
      changes in the code repository.

@lebrice lebrice force-pushed the accelerate_example branch from 2c31075 to 5eb4c79 Compare July 21, 2025 15:55
@lebrice lebrice marked this pull request as ready for review July 29, 2025 17:34
{ url = "https://files.pythonhosted.org/packages/24/fd/725b8e73ac2a50e78a4534ac43c6addf5c1c2d65380dd48a9169cc6739a9/yarl-1.20.1-cp313-cp313t-win32.whl", hash = "sha256:b121ff6a7cbd4abc28985b6028235491941b9fe8fe226e6fdc539c977ea1739d", size = 86591 },
{ url = "https://files.pythonhosted.org/packages/94/c3/b2e9f38bc3e11191981d57ea08cab2166e74ea770024a646617c9cddd9f6/yarl-1.20.1-cp313-cp313t-win_amd64.whl", hash = "sha256:541d050a355bbbc27e55d906bc91cb6fe42f96c01413dd0f4ed5a5240513874f", size = 93003 },
{ url = "https://files.pythonhosted.org/packages/b4/2d/2345fce04cfd4bee161bf1e7d9cdc702e3e16109021035dbb24db654a622/yarl-1.20.1-py3-none-any.whl", hash = "sha256:83b8eb083fe4683c6115795d9fc1cfaf2cbbefb19b3a1cb68f6527460f483a77", size = 46542 },
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Même principe ici - est-ce que les choses dans uv.lock sont d'une réelle importance à l'exemple?


1. Install UV from https://docs.astral.sh/uv

2. On SLURM clusters where you do not have internet access on compute nodes, you need to first create the virtual environment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per https://slurm.schedmd.com/faq.html#acronym, use "Slurm", not "SLURM".

Suggested change
2. On SLURM clusters where you do not have internet access on compute nodes, you need to first create the virtual environment:
2. On Slurm clusters where you do not have internet access on compute nodes, you need to first create the virtual environment:

lebrice added 18 commits August 28, 2025 15:17
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
@lebrice lebrice force-pushed the accelerate_example branch from 56c1a61 to 00ef6dd Compare August 28, 2025 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants