Skip to content

Conversation

@lebrice
Copy link
Collaborator

@lebrice lebrice commented Oct 28, 2025

No description provided.

lebrice and others added 30 commits October 21, 2025 10:23
Signed-off-by: Fabrice Normandin <[email protected]>

Add a code checkpointing utility script

Signed-off-by: Fabrice Normandin <[email protected]>

Fix bug in code_checkpointing script

Signed-off-by: Fabrice Normandin <[email protected]>

Adjust comment text in code_checkpointing.sh

Signed-off-by: Fabrice Normandin <[email protected]>

Tweak comments

Signed-off-by: Fabrice Normandin <[email protected]>

Call .item() only once per tensor to log

Signed-off-by: Fabrice Normandin <[email protected]>

Try new resume_from option from wandb

Signed-off-by: Fabrice Normandin <[email protected]>

Make pyproject work for Tamia

Signed-off-by: Fabrice Normandin <[email protected]>

Add a job script for Tamia cluster

Signed-off-by: Fabrice Normandin <[email protected]>

Fix bug in code_checkpointing script

Signed-off-by: Fabrice Normandin <[email protected]>

Add pyproject_drac.toml for imagenet example

Signed-off-by: Fabrice Normandin <[email protected]>

Don't make the imagenet example a workspace member

Signed-off-by: Fabrice Normandin <[email protected]>

Make the imagenet example a workspace member again

Signed-off-by: Fabrice Normandin <[email protected]>

Add a uv.toml file to use on DRAC

Signed-off-by: Fabrice Normandin <[email protected]>

Add job_fir.sh for fir cluster

Signed-off-by: Fabrice <[email protected]>

Update lockfile on fir

Signed-off-by: Fabrice <[email protected]>

Tweak job_fir.sh

Signed-off-by: Fabrice <[email protected]>

Try to fix wandb settings issue?

Signed-off-by: Fabrice <[email protected]>

Remove the tamia and fir job scripts for now

Signed-off-by: Fabrice Normandin <[email protected]>

Simplify job.sh script

Signed-off-by: Fabrice Normandin <[email protected]>

Save initial checkpoint, resume with wandb

Signed-off-by: Fabrice Normandin <[email protected]>

Minor tweaks

Signed-off-by: Fabrice Normandin <[email protected]>

Fix minor bug in logging frequency

Signed-off-by: Fabrice Normandin <[email protected]>

Enable more options for the PyTorch profiler

Signed-off-by: Fabrice Normandin <[email protected]>

Minor tweaks

Signed-off-by: Fabrice Normandin <[email protected]>

Remove broken launch.json configs

Signed-off-by: Fabrice Normandin <[email protected]>

Fix wandb init 409 errors

Signed-off-by: Fabrice Normandin <[email protected]>

Minor tweaks in comments / code layout

Signed-off-by: Fabrice Normandin <[email protected]>

Fix the `srun + torchrun` comment in job.sh

Signed-off-by: Fabrice Normandin <[email protected]>

Make the index.rst a bit better (still huge)

Signed-off-by: Fabrice Normandin <[email protected]>

Add missing section for advanced examples

Signed-off-by: Fabrice Normandin <[email protected]>

Fix sphinx linting errors

Signed-off-by: Fabrice Normandin <[email protected]>

Apply pre-commit fixes

Signed-off-by: Fabrice Normandin <[email protected]>

Improve main.py script

Signed-off-by: Fabrice Normandin <[email protected]>

Remove the fixme comments

Signed-off-by: Fabrice Normandin <[email protected]>

Add awesome cuda stream data transfer thingy

Signed-off-by: Fabrice Normandin <[email protected]>

Fix typo

Signed-off-by: Fabrice Normandin <[email protected]>

Fix rstcheck error in index.rst

Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
lebrice and others added 30 commits October 24, 2025 12:59
Signed-off-by: Fabrice Normandin <[email protected]>
This reverts commit 9bf7c58.
We can't use conda-pack yet, because we'd have to move the env to the
SLURM_TMPDIR of each node, and it just seems dangerous to have
"different" environments in each node.

Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Signed-off-by: Fabrice Normandin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants