This is the codebase to run megatron efficiently at scale using Clariden. The codebase and arguments are more complex than nanotron but this is currently the price we pay for efficiency at scale.
- Clone megatron to your
iopsstor.
cd /iopsstor/scratch/cscs/$USER/
git clone https://github.com/swiss-ai/Megatron-LM.git
- You need to have to set the weights & biases environment variable with your key. Ideally you add it
~/.bashrc.
export WANDB_API_KEY=57b215a80f....
Now re-login or run source ~/.bashrc to make sure it is set.
- Start an 8B run by submitting the
submit-llama3-8B.shsbatch script.
sbatch submit-llama3-8B.sh
You can find all available arguments here: https://github.com/swiss-ai/Megatron-LM/blob/main/megatron/training/arguments.py
(We will add 1.5B, 3B, and 70B baseline runs using the llama3 architecture very soon.)
-
The main logs will be visible ~2min after launching the script in the
TheMeg-ClaridenProject on your personal weights and biases page. -
The local logs will be at
/iopsstor/scratch/cscs/schlag/Megatron-LM/logswith the slurm outputs and error files at/iopsstor/scratch/cscs/schlag/Megatron-LM/logs/slurm/training.
You can submit issues and create branches on https://github.com/swiss-ai/Megatron-LM. The main branch is protected so you won't be able to directly commit to it.
- Pull the latest version
git pull
- Create a branch with a descriptive name
git checkout -b my-contribution-branch
-
Make your changes locally and commit them. Make sure your latest commit works as expected. Do not commit unnecessary files.
-
sync with the latest main
git pull origin main
- Push your commit to your branch
git push origin my-contribution-branch
-
Go to
https://github.com/swiss-ai/Megatron-LMand create a new issue that references your branch and explains what you did. Add ischlag or TJ-Solergibert for now. -
Use this command to return to main branch.
git checkout main
We will add detailed steps for specific workflows fairly soon.