-
Notifications
You must be signed in to change notification settings - Fork 24
[DOC-27] Add an multi-node example based on Accelerate #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
2c31075 to
5eb4c79
Compare
| { url = "https://files.pythonhosted.org/packages/24/fd/725b8e73ac2a50e78a4534ac43c6addf5c1c2d65380dd48a9169cc6739a9/yarl-1.20.1-cp313-cp313t-win32.whl", hash = "sha256:b121ff6a7cbd4abc28985b6028235491941b9fe8fe226e6fdc539c977ea1739d", size = 86591 }, | ||
| { url = "https://files.pythonhosted.org/packages/94/c3/b2e9f38bc3e11191981d57ea08cab2166e74ea770024a646617c9cddd9f6/yarl-1.20.1-cp313-cp313t-win_amd64.whl", hash = "sha256:541d050a355bbbc27e55d906bc91cb6fe42f96c01413dd0f4ed5a5240513874f", size = 93003 }, | ||
| { url = "https://files.pythonhosted.org/packages/b4/2d/2345fce04cfd4bee161bf1e7d9cdc702e3e16109021035dbb24db654a622/yarl-1.20.1-py3-none-any.whl", hash = "sha256:83b8eb083fe4683c6115795d9fc1cfaf2cbbefb19b3a1cb68f6527460f483a77", size = 46542 }, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Même principe ici - est-ce que les choses dans uv.lock sont d'une réelle importance à l'exemple?
|
|
||
| 1. Install UV from https://docs.astral.sh/uv | ||
|
|
||
| 2. On SLURM clusters where you do not have internet access on compute nodes, you need to first create the virtual environment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://slurm.schedmd.com/faq.html#acronym, use "Slurm", not "SLURM".
| 2. On SLURM clusters where you do not have internet access on compute nodes, you need to first create the virtual environment: | |
| 2. On Slurm clusters where you do not have internet access on compute nodes, you need to first create the virtual environment: |
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
56c1a61 to
00ef6dd
Compare
Example of how to use accelerate with SLURM, based on https://github.com/huggingface/accelerate/blob/main/examples/by_feature/checkpointing.py
Compared to that example, here are the differences (benefits):
$SLURM_TMPDIRuv runin that directory to recreate the virtualenv in$SLURM_TMPDIRon each node in a multi-node job.safe_sbatchscript instead ofsbatch.Compared to
sbatch,safe_sbatchwill prevent submitting jobs if there are uncommittedchanges in the code repository.