Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
ed98e2f
Fix wandb tracker AttributeError on non-main processes in SFT
hamishivi Mar 19, 2026
eeac0af
Fix UnboundLocalError for beaker_config in tracking setup
hamishivi Mar 19, 2026
0491264
Apply ruff formatting
hamishivi Mar 19, 2026
81eb363
Update changelog with SFT multi-node fixes
hamishivi Mar 19, 2026
15df712
Sync SP changes: direct imports, updated help text, SP test script
hamishivi Mar 19, 2026
96f1a7b
Inline attn_impl variable
hamishivi Mar 19, 2026
7d3df06
Add PR link to changelog entries
hamishivi Mar 19, 2026
9728cd7
Fix import sorting: merge accelerate.utils imports
hamishivi Mar 19, 2026
4049841
Require flash attention for sequence parallelism
hamishivi Mar 19, 2026
c0f7d76
Fix another unbound beaker_config reference at end of training
hamishivi Mar 19, 2026
84802f8
Skip SP setup during dataset caching
hamishivi Mar 19, 2026
caf67a1
Set dp_shard_size for ParallelismConfig to match world size
hamishivi Mar 19, 2026
c2fd078
Handle UlyssesSPDataLoaderAdapter missing set_epoch
hamishivi Mar 19, 2026
dacdc46
Unwrap SP dataloader adapter for set_epoch
hamishivi Mar 19, 2026
4ff4c0a
Add comment explaining SP dataloader unwrap
hamishivi Mar 19, 2026
dcc9592
Set multinode SFT test to urgent priority
hamishivi Mar 19, 2026
c1fe3ed
Fix SP dataloader unwrap: attribute is .dl not .dataloader
hamishivi Mar 19, 2026
33c04db
Filter non-2D tensors from batch for SP dataloader compatibility
hamishivi Mar 19, 2026
c669383
Log which batch keys are dropped by SP collator filter
hamishivi Mar 19, 2026
fe1cc31
Root cause: 1D index column from dataset cache breaks SP adapter
hamishivi Mar 19, 2026
a51cf2a
Pad batch seq length to be divisible by SP size
hamishivi Mar 19, 2026
0f67778
Handle shift_labels key from UlyssesSPDataLoaderAdapter
hamishivi Mar 19, 2026
f05e7fa
Move SP batch tensors to device (adapter returns CPU tensors)
hamishivi Mar 19, 2026
1bc7044
Always move batch to device (no-op when already there)
hamishivi Mar 19, 2026
c0f7afd
Remove redundant comments
hamishivi Mar 19, 2026
1ab980c
Rename shift_labels back to labels before model forward pass
hamishivi Mar 19, 2026
497243a
Recreate LR scheduler after prepare when using SP
hamishivi Mar 19, 2026
955ae24
Use getattr for set_epoch to avoid SP branch
hamishivi Mar 19, 2026
dcfe350
Just pop index column instead of filtering all non-2D tensors
hamishivi Mar 19, 2026
1c41439
Always create LR scheduler after prepare for correct step count
hamishivi Mar 19, 2026
5859150
Rename shift_labels early, remove duplicate rename before forward
hamishivi Mar 19, 2026
42b9844
Restore original LR scheduler for non-SP, fix SP scheduler wrapping
hamishivi Mar 19, 2026
575b724
Fix SP scheduler: use post-prepare max_train_steps without num_proces…
hamishivi Mar 19, 2026
9e79b1f
Fix SP scheduler: account for micro-batch stepping with grad_accum mu…
hamishivi Mar 19, 2026
c02f0c3
Merge main, resolve changelog conflict
hamishivi Mar 19, 2026
6b08182
Validate world_size divisible by sequence_parallel_size
hamishivi Mar 19, 2026
91655d1
Extract _create_scheduler helper to deduplicate scheduler creation
hamishivi Mar 19, 2026
ec32cbc
Fix total_batch_size logging to account for sequence parallelism
hamishivi Mar 20, 2026
aca5e1f
Merge remote-tracking branch 'origin/main' into hamishivi/fix-sft-sp-…
hamishivi Mar 20, 2026
2afde3a
Add changelog entry for batch size logging fix
hamishivi Mar 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ All notable changes to this project will be documented in this file.
- OLMo-core GRPO actor with Ray-distributed FSDP2 training (https://github.com/allenai/open-instruct/pull/1398).

### Fixed
- Fix `total_batch_size` logging to account for sequence parallelism (SP ranks share data, not independent) (https://github.com/allenai/open-instruct/pull/1542).
- Fix `wandb_tracker.run.url` `AttributeError` on non-main processes in multi-node SFT training by guarding accesses with `accelerator.is_main_process` checks (https://github.com/allenai/open-instruct/pull/1539).
- Fix `UnboundLocalError` for `beaker_config` in SFT tracking setup when `push_to_hub` is disabled (https://github.com/allenai/open-instruct/pull/1539).
- Pre-download HF model on main process before Ray actors spawn to avoid hitting HuggingFace rate limits (https://github.com/allenai/open-instruct/pull/1528).
Expand Down
3 changes: 2 additions & 1 deletion open_instruct/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -730,7 +730,8 @@ def collate_fn(features):
checkpointing_steps = int(checkpointing_steps)

# Train!
total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
dp_world_size = accelerator.num_processes // args.sequence_parallel_size
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you ofc but I think we should refactor this somewhere so we can share it across DPO, GRPO, SFT

total_batch_size = args.per_device_train_batch_size * dp_world_size * args.gradient_accumulation_steps
logger.info("***** Running training *****")
logger.info(f" Num examples = {len(train_dataset)}")
logger.info(f" Num Epochs = {args.num_train_epochs}")
Expand Down
Loading