Skip to content

Add OLMo-core based DPO training module#1391

Merged
finbarrtimbers merged 70 commits intomainfrom
finbarr/olmo-core-dpo-base
Jan 26, 2026
Merged

Add OLMo-core based DPO training module#1391
finbarrtimbers merged 70 commits intomainfrom
finbarr/olmo-core-dpo-base

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented Jan 20, 2026

Adds an Olmo-core based DPO implementation in dpo.py.

Runs:

  • Single GPU DPO: Beaker
  • Multi-node DPO: Beaker
  • Also ran single GPU GRPO based off the result of running single GPU DPO: Beaker

GPU_TESTS=01KFH82YPAC7ECB4XXEHH10TS7

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by integrating Direct Preference Optimization (DPO) directly into the OLMo-core framework. It introduces a dedicated DPO training module that leverages OLMo-core's distributed training infrastructure, including HSDP, to enable scalable and efficient fine-tuning of large language models using preference data. The changes streamline the process of applying DPO by providing specialized utilities for reference log probability caching and adapting model forward passes for the OLMo-core architecture.

Highlights

  • DPO Training Module: Introduced a new dpo.py module for Direct Preference Optimization (DPO) training, fully integrated with OLMo-core's TrainModule and supporting Hybrid Sharded Data Parallel (HSDP).
  • Reference Logprobs Caching: Added build_reference_logprobs_cache_olmo to dpo_utils.py, providing a generic mechanism for efficient caching of reference model log probabilities within the OLMo-core framework.
  • DPO Loss Computation: Implemented compute_loss_olmo in dpo_utils.py as a wrapper for DPO loss calculations, ensuring seamless integration with ExperimentConfig.
  • OLMo-core Forward Functions: Provided concatenated_forward_olmo and separate_forward_olmo within the DPO context to handle model forward passes specifically tailored for OLMo-core models.
  • Integration with Mason: Updated mason.py to include the new open_instruct/dpo.py script in the OPEN_INSTRUCT_COMMANDS list, making the DPO training functionality discoverable and runnable.
  • Debug Script Updates: Modified existing debug scripts (dpo.sh, large_dpo.sh, medium_dpo.sh) to utilize torchrun for launching DPO training with OLMo-core models, replacing the previous accelerate launch method.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DPO (Direct Preference Optimization) training module that leverages OLMo-core's native training infrastructure, including its TrainModule and HSDP support. The changes are well-structured, integrating new utility functions for reference log-probability caching and loss computation tailored for OLMo-core models. The accompanying debug scripts have been updated to reflect the new torchrun based launch mechanism and OLMo-core specific model configurations. The implementation appears robust and correctly handles distributed training aspects.

Comment thread open_instruct/dpo.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a43a450fd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/dpo.py
Comment on lines +447 to +448
device_name = utils.get_device_name(torch.cuda.get_device_name(0))
device_peak_flops = int(utils.GPU_SPECS[device_name]["flops"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard GPU-only device name lookup

This module explicitly falls back to CPU (device = "cpu" when CUDA is unavailable), but it later unconditionally calls torch.cuda.get_device_name(0). On a CPU-only host (or any environment where CUDA isn’t initialized), that call raises and the training run crashes before callbacks are built. If CPU fallback is intentional, this needs a CUDA availability guard or to skip the speed monitor setup when CUDA isn’t available.

Useful? React with 👍 / 👎.

Comment thread open_instruct/dpo.py Outdated
Comment on lines +361 to +364
def make_disable_adapter_context() -> contextlib.AbstractContextManager:
if args.use_lora:
assert isinstance(model, peft.PeftModel)
return model.disable_adapter()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge --use_lora crashes without LoRA setup

When --use_lora is enabled, the code asserts that the model is already a peft.PeftModel, but this script never applies any LoRA wrapping to the OLMo-core model. As a result, any run that enables --use_lora will immediately assert and abort during reference logprob caching. Either LoRA needs to be applied before this point or the script should error out earlier with a clear “not supported” message.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a43a450fd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/dpo.py
Comment thread open_instruct/dpo.py Outdated
- Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support
- Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core
- Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig
- Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions
- Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS
- Update debug scripts to use torchrun with OLMo-core models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers force-pushed the finbarr/olmo-core-dpo-base branch from a43a450 to 69120dd Compare January 20, 2026 16:22
finbarrtimbers and others added 7 commits January 20, 2026 10:01
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change device_peak_flops_per_second to device_peak_flops to match
the OLMo-core API.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set default checkpointing_steps to 500 when not specified, since
the OLMo-core API requires save_interval >= 1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move the default value for checkpointing_steps (500) from dpo.py to the
CheckpointConfig dataclass in dpo_utils.py. This centralizes the default
and removes the conditional logic in the callback setup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpointing_steps field was defined in both CheckpointConfig (the
parent class) and ExperimentConfig. The duplicate field in ExperimentConfig
had default=None, which overrode the parent class's default of 500, causing
a TypeError when int() was called on None in dpo.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Saturn as an alternative cluster to help with multi-node scheduling
reliability.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@finbarrtimbers
Copy link
Copy Markdown
Collaborator Author

Experiment Results

Ran single GPU DPO script (Beaker) successfully.

Multi-node DPO experiments are experiencing Beaker cluster rendezvous timeout issues (infrastructure-related, not code issues). Will re-run when cluster stability improves.

Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some comments. Some other things:

  1. could we add a single-gpu script that runs locally? I tried uv run torchrun --standalone --nproc_per_node=1 open_instruct/dpo.py --model_name_or_path allenai/OLMo-2-0425-1B --tokenizer_name allenai/OLMo-2-0425-1B --use_flash_attn false --max_seq_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 5e-07 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.0 --num_epochs 3 --output_dir output/dpo_olmo_core_debug/ --logging_steps 1 --mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 100 --chat_template_name olmo --seed 123 --try_launch_beaker_eval_jobs false but it (a) errored initially after building the cache and then (b) hung on further training. Just make a beaker image with beaker://ai2/cuda12.8-dev-ubuntu22.04-notorch and try run with uv to recreate.

  2. It looks like the multi-node job ran okay but exited with an error? Is that fixable?

Comment thread open_instruct/dpo.py Outdated
Comment thread open_instruct/dpo.py Outdated
Comment thread open_instruct/dpo_utils.py
Comment thread scripts/train/debug/dpo/single_gpu.sh
Comment thread open_instruct/dpo.py Outdated
Comment thread open_instruct/dpo.py Outdated
finbarrtimbers and others added 5 commits January 21, 2026 09:30
OLMo-core's prepare_training_environment() handles multi-node setup
internally using Beaker's environment variables. The explicit --nnodes,
--standalone, and --rdzv_backend=c10d arguments interfere with this and
cause RendezvousTimeoutError on multi-node runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py
- Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree
- Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP
- Fix critical bug: apply HSDP before computing reference logprobs cache
- Add LoRA error check (not supported with OLMo-core)
- Remove unreachable make_disable_adapter_context function
- Reorganize DPO scripts to scripts/train/debug/dpo/
- Add local.sh for testing without Beaker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only the main process should create the cache directory and test write
permissions. Other ranks now wait at a barrier until this is complete.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@finbarrtimbers
Copy link
Copy Markdown
Collaborator Author

DPO Experiment Results

Ran single GPU DPO script (Beaker) and multi-node DPO (Beaker) scripts.

Results:

  • Single GPU DPO: ✅ Passed (exit code 0)
  • Multi-node DPO: ✅ Training completed successfully with checkpoints saved. Had a non-critical exit code 1 due to a multi-node barrier failure in post-training cleanup (a known issue with multi-node jobs where one node finishes before the other).

Changes in this commit:

  • Move generic OLMo code to olmo_core_utils.py
  • Add support for tensor, context, and pipeline parallelism (tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree)
  • Add LoRA error check (not supported with OLMo-core)
  • Fix critical bug: apply HSDP before computing reference logprobs cache
  • Fix race condition in reference logprobs cache directory creation
  • Reorganize DPO scripts to scripts/train/debug/dpo/
  • Add local.sh for testing without Beaker

Two barrier issues caused "Connection closed by peer" gloo errors during
post-training cleanup:

1. Unconditional barrier at start of _handle_post_training called even
   when distributed training wasn't active

2. Asymmetric barrier inside beaker save conditional - only main_process
   reached this code due to is_main_process check, causing non-main
   processes to hang at the barrier while main does file I/O

Fix: Gate the initial barrier on is_distributed() and remove the
asymmetric inner barrier entirely since only main_process enters
that code block anyway.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@finbarrtimbers
Copy link
Copy Markdown
Collaborator Author

Mostly some comments. Some other things:

  1. could we add a single-gpu script that runs locally? I tried uv run torchrun --standalone --nproc_per_node=1 open_instruct/dpo.py --model_name_or_path allenai/OLMo-2-0425-1B --tokenizer_name allenai/OLMo-2-0425-1B --use_flash_attn false --max_seq_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 5e-07 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.0 --num_epochs 3 --output_dir output/dpo_olmo_core_debug/ --logging_steps 1 --mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 100 --chat_template_name olmo --seed 123 --try_launch_beaker_eval_jobs false but it (a) errored initially after building the cache and then (b) hung on further training. Just make a beaker image with beaker://ai2/cuda12.8-dev-ubuntu22.04-notorch and try run with uv to recreate.
  2. It looks like the multi-node job ran okay but exited with an error? Is that fixable?

Fixed the multi-node job and adding a local script!

finbarrtimbers and others added 18 commits January 22, 2026 12:38
Log entry/exit for all ranks and each step in the export process.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The full_tensor() call on DTensors is a collective operation that requires
all ranks to participate. Move the conversion outside the is_main_process
check so all ranks call full_tensor().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add drop_last parameter to HFDataLoader. When drop_last=False, pad the
remainder with repeated indices to fill a complete batch, ensuring all
dataset indices are processed. Use drop_last=False for the cache-building
dataloader to prevent -inf values in the reference logprobs cache.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Forward-only cache pass doesn't store activations, so we can use 3x
the training batch size. Also display avg_tok/ex, MFU%, and mem_GB
in the tqdm progress bar during cache building.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The DPO reference logprobs cache is forward-only (no backward pass), so
the full unsharded model may fit in GPU memory and avoids allgather
communication overhead. If it OOMs, we catch the error, clear the CUDA
cache, apply FSDP, and retry.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ze > 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The auto-detection was selecting flash_3 for H100 GPUs without checking
if the package is actually installed, causing RuntimeError on startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use logger instead of print for output
- Remove unused model.load_state_dict() call

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@finbarrtimbers finbarrtimbers added this pull request to the merge queue Jan 26, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jan 26, 2026
@finbarrtimbers finbarrtimbers added this pull request to the merge queue Jan 26, 2026
Merged via the queue into main with commit 8befd55 Jan 26, 2026
7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/olmo-core-dpo-base branch January 26, 2026 18:13
lukashelff pushed a commit to lukashelff/open-instruct-slurm that referenced this pull request Feb 19, 2026
* Add OLMo-core based DPO training module

- Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support
- Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core
- Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig
- Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions
- Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS
- Update debug scripts to use torchrun with OLMo-core models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Cleaned up PR.

* Add OLMo-core train modules for DPO training

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix SpeedMonitorCallback parameter name

Change device_peak_flops_per_second to device_peak_flops to match
the OLMo-core API.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix CheckpointerCallback save_interval validation

Set default checkpointing_steps to 500 when not specified, since
the OLMo-core API requires save_interval >= 1.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Move checkpointing_steps default value to config class

Move the default value for checkpointing_steps (500) from dpo.py to the
CheckpointConfig dataclass in dpo_utils.py. This centralizes the default
and removes the conditional logic in the callback setup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove duplicate checkpointing_steps field from ExperimentConfig

The checkpointing_steps field was defined in both CheckpointConfig (the
parent class) and ExperimentConfig. The duplicate field in ExperimentConfig
had default=None, which overrode the parent class's default of 500, causing
a TypeError when int() was called on None in dpo.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add Saturn cluster to medium_dpo.sh script

Add Saturn as an alternative cluster to help with multi-node scheduling
reliability.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* updated changelog

* Remove explicit torchrun multi-node args from DPO scripts

OLMo-core's prepare_training_environment() handles multi-node setup
internally using Beaker's environment variables. The explicit --nnodes,
--standalone, and --rdzv_backend=c10d arguments interfere with this and
cause RendezvousTimeoutError on multi-node runs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fixed linter errors

* Refactor DPO OLMo-core: add parallelism support, fix HSDP order

- Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py
- Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree
- Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP
- Fix critical bug: apply HSDP before computing reference logprobs cache
- Add LoRA error check (not supported with OLMo-core)
- Remove unreachable make_disable_adapter_context function
- Reorganize DPO scripts to scripts/train/debug/dpo/
- Add local.sh for testing without Beaker

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix race condition in reference logprobs cache directory creation

Only the main process should create the cache directory and test write
permissions. Other ranks now wait at a barrier until this is complete.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix multi-node DPO post-training barrier failures

Two barrier issues caused "Connection closed by peer" gloo errors during
post-training cleanup:

1. Unconditional barrier at start of _handle_post_training called even
   when distributed training wasn't active

2. Asymmetric barrier inside beaker save conditional - only main_process
   reached this code due to is_main_process check, causing non-main
   processes to hang at the barrier while main does file I/O

Fix: Gate the initial barrier on is_distributed() and remove the
asymmetric inner barrier entirely since only main_process enters
that code block anyway.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove redundant compute_loss_olmo wrapper function

ExperimentConfig inherits from DPOConfig, so compute_loss() accepts
ExperimentConfig directly. The wrapper was unnecessarily creating a new
DPOConfig object when one wasn't needed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* run urgent tests

* Fix case-insensitive beaker secret lookup

Beaker stores secret names case-insensitively, but Python's `in` operator
is case-sensitive. This caused lookups for `finbarrt_WANDB_API_KEY` to fail
when the secret was stored as `FINBARRT_WANDB_API_KEY`.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Updated mason.py

* Add uv run prefix to local DPO script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Save DPO models in HuggingFace format for evals

DPO training was saving models in olmo-core format, but eval jobs
and push_folder_to_hub expect HuggingFace format. Use olmo-core's
save_hf_model() to convert the trained model to HF format in
output_dir/hf_model/ before launching evals or pushing to hub.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix WEKA_CLUSTERS import in submit_eval_jobs.py

WEKA_CLUSTERS is defined in launch_utils, not utils. Import launch_utils
and use launch_utils.WEKA_CLUSTERS instead of utils.WEKA_CLUSTERS.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO single GPU script to use DPO-trained model

Use the DPO-trained OLMo model from allenai/open_instruct_dev with
revision dpo_olmo_core_debug_test instead of Qwen/Qwen3-1.7B.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --add_bos flag for OLMo model in GRPO script

OLMo models require the --add_bos flag to be set.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Copy original HF config when saving DPO model

The save_hf_model() function creates an incorrect config.json with
wrong values for num_hidden_layers, eos_token_id, etc. Copy the
original model's config.json to preserve the correct values.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Use Weka path directly for DPO model in GRPO test

The HuggingFace model config was still incorrect, so use the
Weka path directly where the model was saved.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add logging for config.json save in DPO

Helps debug issues with model config not being saved correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use new DPO model path

Use the latest DPO model that was saved with correct config.json.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DPO HF model saving to use correct layer count

The save_hf_model function from olmo-core was creating extra layers
in the output. Instead, use convert_state_to_hf with the original
HuggingFace config and save using transformers' native save_pretrained.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix OLMo-2-0425-1B config mapping to use correct layer count

The olmo2_1B config has 18 layers but the actual HuggingFace model
has 16 layers. Use olmo2_1B_v2 which has the correct 16 layers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix HF model loading to use from_config instead of from_pretrained

Cannot pass state_dict together with a model name. Use from_config
to create the model, then load_state_dict to load the weights.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Revert to using save_hf_model for DPO model saving

The convert_state_to_hf approach doesn't work with DTensors from
distributed training. Use save_hf_model which handles DTensors
properly. The config mapping has been fixed so save_hf_model should
now produce correct layer counts.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use DPO model with correct 16 layers

Use the model saved from the DPO run with fixed config mapping.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Copy original HF config after save_hf_model

The save_hf_model function creates an incomplete config.json that
is missing fields like max_position_embeddings. Copy the original
model's config to ensure vLLM can load the model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update GRPO script to use DPO model with complete config

Use the model from the DPO run with copied original config
that includes max_position_embeddings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add OLMo3-7B DPO script using OLMo-core trainer

New script that uses dpo.py (OLMo-core + FSDP) instead of
dpo_tune_cache.py (Accelerate + DeepSpeed) for DPO training.
Configured for 2 nodes with 8k sequence length.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add documentation for adding OLMo-core models

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --no_auto_dataset_cache to DPO script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix multi-node torchrun configuration for DPO

Add missing torchrun multi-node parameters:
- --nnodes to specify total number of nodes
- --node_rank for each node's rank
- --master_addr for coordinator address
- --master_port for coordinator port

These use Beaker environment variables that get substituted at runtime.
Without these, each node ran independently without distributed communication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix nnodes to use hardcoded value instead of BEAKER_NUM_REPLICAS

BEAKER_NUM_REPLICAS is not a valid Beaker environment variable.
Use hardcoded value of 2 to match --num_nodes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add torchrun multi-node parameters to debug DPO multi_node.sh

Same fix as 7b_instruct_dpo_olmo_core.sh - add nnodes, node_rank,
master_addr, and master_port for proper multi-node coordination.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add OLMO_SHARED_FS=1 env var for multi-node DPO scripts

OLMo-core's checkpointing code requires this env var to be set when
using a shared filesystem (like Weka) to avoid unnecessary distributed
coordination for filesystem operations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add comment about cache cleanup for corrupted dataset cache

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Remove cache cleanup comment

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Support separate model config and weights for OLMo-core DPO

Allow users to specify a config_name separately from model_name_or_path,
enabling local model paths to work with OLMo-core DPO training.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix save_hf_model for FSDP-wrapped models in DPO

Add export_to_hf() function that builds an unwrapped model from config
and loads the FSDP state dict before saving. This avoids the type check
failure in olmo-core's get_hf_config() for FSDP-wrapped models.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DTensor to Tensor conversion in export_to_hf

Convert DTensors from FSDP state dict to regular CPU tensors before
loading into the unwrapped model.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix FSDP state_dict collective operation for multi-node export

All ranks must participate in model.state_dict() as it's a collective
operation for FSDP models. Only rank 0 now saves to disk.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add detailed logging to export_to_hf for debugging

Log entry/exit for all ranks and each step in the export process.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix DTensor full_tensor() collective operation in export

The full_tensor() call on DTensors is a collective operation that requires
all ranks to participate. Move the conversion outside the is_main_process
check so all ranks call full_tensor().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Clean up debug logging in export_to_hf

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix missing indices in DPO reference logprobs caching

Add drop_last parameter to HFDataLoader. When drop_last=False, pad the
remainder with repeated indices to fill a complete batch, ensuring all
dataset indices are processed. Use drop_last=False for the cache-building
dataloader to prevent -inf values in the reference logprobs cache.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add MFU/memory/token metrics to cache building + 3x cache batch size

Forward-only cache pass doesn't store activations, so we can use 3x
the training batch size. Also display avg_tok/ex, MFU%, and mem_GB
in the tqdm progress bar during cache building.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Add --cache_logprobs_only flag for DPO cache forward-pass benchmarking

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update DPO cache benchmark to match production OLMo3-7B config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Now, we avoid the torch warning

* 6x cache batch size + mem% in DPO cache tqdm

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Reduce cache batch multiplier to 4x (6x OOMed)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Try unsharded cache build, fall back to FSDP on OOM

The DPO reference logprobs cache is forward-only (no backward pass), so
the full unsharded model may fit in GPU memory and avoids allgather
communication overhead. If it OOMs, we catch the error, clear the CUDA
cache, apply FSDP, and retry.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix data loader tests that used single_example_collator with batch_size > 1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix attn_backend auto-detection: check flash_attn_3 availability

The auto-detection was selecting flash_3 for H100 GPUs without checking
if the package is actually installed, causing RuntimeError on startup.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* added export to HF function

* Added script to convert olmo core to HF format.

* Add example usage to olmo-core to HF conversion script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Fix code review issues in convert_olmo_core_to_hf.py

- Use logger instead of print for output
- Remove unused model.load_state_dict() call

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants