-
Notifications
You must be signed in to change notification settings - Fork 24
Add an LLM pretraining / finetuning example with Accelerate #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
lebrice
wants to merge
90
commits into
mila-iqia:master
Choose a base branch
from
lebrice:llm_training
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Fabrice Normandin <[email protected]> Add a code checkpointing utility script Signed-off-by: Fabrice Normandin <[email protected]> Fix bug in code_checkpointing script Signed-off-by: Fabrice Normandin <[email protected]> Adjust comment text in code_checkpointing.sh Signed-off-by: Fabrice Normandin <[email protected]> Tweak comments Signed-off-by: Fabrice Normandin <[email protected]> Call .item() only once per tensor to log Signed-off-by: Fabrice Normandin <[email protected]> Try new resume_from option from wandb Signed-off-by: Fabrice Normandin <[email protected]> Make pyproject work for Tamia Signed-off-by: Fabrice Normandin <[email protected]> Add a job script for Tamia cluster Signed-off-by: Fabrice Normandin <[email protected]> Fix bug in code_checkpointing script Signed-off-by: Fabrice Normandin <[email protected]> Add pyproject_drac.toml for imagenet example Signed-off-by: Fabrice Normandin <[email protected]> Don't make the imagenet example a workspace member Signed-off-by: Fabrice Normandin <[email protected]> Make the imagenet example a workspace member again Signed-off-by: Fabrice Normandin <[email protected]> Add a uv.toml file to use on DRAC Signed-off-by: Fabrice Normandin <[email protected]> Add job_fir.sh for fir cluster Signed-off-by: Fabrice <[email protected]> Update lockfile on fir Signed-off-by: Fabrice <[email protected]> Tweak job_fir.sh Signed-off-by: Fabrice <[email protected]> Try to fix wandb settings issue? Signed-off-by: Fabrice <[email protected]> Remove the tamia and fir job scripts for now Signed-off-by: Fabrice Normandin <[email protected]> Simplify job.sh script Signed-off-by: Fabrice Normandin <[email protected]> Save initial checkpoint, resume with wandb Signed-off-by: Fabrice Normandin <[email protected]> Minor tweaks Signed-off-by: Fabrice Normandin <[email protected]> Fix minor bug in logging frequency Signed-off-by: Fabrice Normandin <[email protected]> Enable more options for the PyTorch profiler Signed-off-by: Fabrice Normandin <[email protected]> Minor tweaks Signed-off-by: Fabrice Normandin <[email protected]> Remove broken launch.json configs Signed-off-by: Fabrice Normandin <[email protected]> Fix wandb init 409 errors Signed-off-by: Fabrice Normandin <[email protected]> Minor tweaks in comments / code layout Signed-off-by: Fabrice Normandin <[email protected]> Fix the `srun + torchrun` comment in job.sh Signed-off-by: Fabrice Normandin <[email protected]> Make the index.rst a bit better (still huge) Signed-off-by: Fabrice Normandin <[email protected]> Add missing section for advanced examples Signed-off-by: Fabrice Normandin <[email protected]> Fix sphinx linting errors Signed-off-by: Fabrice Normandin <[email protected]> Apply pre-commit fixes Signed-off-by: Fabrice Normandin <[email protected]> Improve main.py script Signed-off-by: Fabrice Normandin <[email protected]> Remove the fixme comments Signed-off-by: Fabrice Normandin <[email protected]> Add awesome cuda stream data transfer thingy Signed-off-by: Fabrice Normandin <[email protected]> Fix typo Signed-off-by: Fabrice Normandin <[email protected]> Fix rstcheck error in index.rst Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
This reverts commit e4d45a9.
Signed-off-by: Fabrice Normandin <[email protected]>
This reverts commit 9bf7c58. We can't use conda-pack yet, because we'd have to move the env to the SLURM_TMPDIR of each node, and it just seems dangerous to have "different" environments in each node. Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.