Skip to content

Conversation

@jeffreywpli
Copy link
Collaborator

@jeffreywpli jeffreywpli commented Jul 4, 2024

Three main changes (originally motivated by making it easier to chain together consecutive calls to open_lm.main)

  1. Allow for user to specify their own custom NCCL timeout. The default is 10 minutes which may not be enough to run final remote sync in all scenarios (causing a timeout because all other processes hit a barrier).

  2. Remove unnecessary waits during the initial "test call" of remote sync and final sync by hard-passing in 0 instead of args.sync_every into remote_sync_with_expon_backoff. These are the only instances in main where remote_sync_with_expon_backoff is called directly (as opposed to in a subprocess) and therefore causes whole script to hang when sync_every is not 0.

  3. args.log_local perhaps does not work as originally intended, since many logging calls are still gated behind an if is_master(args) instead of if is_master(args, local=args.log_local).

@jeffreywpli jeffreywpli force-pushed the allow_nccl_timeout_increase branch from 6f8fbab to b2cfded Compare April 27, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants