Add OLMo-core based DPO training module by finbarrtimbers · Pull Request #1391 · allenai/open-instruct

finbarrtimbers · 2026-01-20T15:51:29Z

Adds an Olmo-core based DPO implementation in dpo.py.

Runs:

Single GPU DPO: Beaker
Multi-node DPO: Beaker
Also ran single GPU GRPO based off the result of running single GPU DPO: Beaker

gemini-code-assist · 2026-01-20T15:51:56Z

Summary of Changes

Hello @finbarrtimbers, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the training capabilities by integrating Direct Preference Optimization (DPO) directly into the OLMo-core framework. It introduces a dedicated DPO training module that leverages OLMo-core's distributed training infrastructure, including HSDP, to enable scalable and efficient fine-tuning of large language models using preference data. The changes streamline the process of applying DPO by providing specialized utilities for reference log probability caching and adapting model forward passes for the OLMo-core architecture.

Highlights

DPO Training Module: Introduced a new dpo.py module for Direct Preference Optimization (DPO) training, fully integrated with OLMo-core's TrainModule and supporting Hybrid Sharded Data Parallel (HSDP).
Reference Logprobs Caching: Added build_reference_logprobs_cache_olmo to dpo_utils.py, providing a generic mechanism for efficient caching of reference model log probabilities within the OLMo-core framework.
DPO Loss Computation: Implemented compute_loss_olmo in dpo_utils.py as a wrapper for DPO loss calculations, ensuring seamless integration with ExperimentConfig.
OLMo-core Forward Functions: Provided concatenated_forward_olmo and separate_forward_olmo within the DPO context to handle model forward passes specifically tailored for OLMo-core models.
Integration with Mason: Updated mason.py to include the new open_instruct/dpo.py script in the OPEN_INSTRUCT_COMMANDS list, making the DPO training functionality discoverable and runnable.
Debug Script Updates: Modified existing debug scripts (dpo.sh, large_dpo.sh, medium_dpo.sh) to utilize torchrun for launching DPO training with OLMo-core models, replacing the previous accelerate launch method.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new DPO (Direct Preference Optimization) training module that leverages OLMo-core's native training infrastructure, including its TrainModule and HSDP support. The changes are well-structured, integrating new utility functions for reference log-probability caching and loss computation tailored for OLMo-core models. The accompanying debug scripts have been updated to reflect the new torchrun based launch mechanism and OLMo-core specific model configurations. The implementation appears robust and correctly handles distributed training aspects.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a43a450fd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-20T15:58:01Z

+    device_name = utils.get_device_name(torch.cuda.get_device_name(0))
+    device_peak_flops = int(utils.GPU_SPECS[device_name]["flops"])


Guard GPU-only device name lookup

This module explicitly falls back to CPU (device = "cpu" when CUDA is unavailable), but it later unconditionally calls torch.cuda.get_device_name(0). On a CPU-only host (or any environment where CUDA isn’t initialized), that call raises and the training run crashes before callbacks are built. If CPU fallback is intentional, this needs a CUDA availability guard or to skip the speed monitor setup when CUDA isn’t available.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-20T15:58:01Z

+    def make_disable_adapter_context() -> contextlib.AbstractContextManager:
+        if args.use_lora:
+            assert isinstance(model, peft.PeftModel)
+            return model.disable_adapter()


--use_lora crashes without LoRA setup

When --use_lora is enabled, the code asserts that the model is already a peft.PeftModel, but this script never applies any LoRA wrapping to the OLMo-core model. As a result, any run that enables --use_lora will immediately assert and abort during reference logprob caching. Either LoRA needs to be applied before this point or the script should error out earlier with a clear “not supported” message.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a43a450fd7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support - Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core - Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig - Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions - Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS - Update debug scripts to use torchrun with OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Change device_peak_flops_per_second to device_peak_flops to match the OLMo-core API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Set default checkpointing_steps to 500 when not specified, since the OLMo-core API requires save_interval >= 1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Move the default value for checkpointing_steps (500) from dpo.py to the CheckpointConfig dataclass in dpo_utils.py. This centralizes the default and removes the conditional logic in the callback setup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The checkpointing_steps field was defined in both CheckpointConfig (the parent class) and ExperimentConfig. The duplicate field in ExperimentConfig had default=None, which overrode the parent class's default of 500, causing a TypeError when int() was called on None in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Saturn as an alternative cluster to help with multi-node scheduling reliability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers · 2026-01-20T18:20:58Z

Experiment Results

Ran single GPU DPO script (Beaker) successfully.

Multi-node DPO experiments are experiencing Beaker cluster rendezvous timeout issues (infrastructure-related, not code issues). Will re-run when cluster stability improves.

hamishivi

Mostly some comments. Some other things:

could we add a single-gpu script that runs locally? I tried uv run torchrun --standalone --nproc_per_node=1 open_instruct/dpo.py --model_name_or_path allenai/OLMo-2-0425-1B --tokenizer_name allenai/OLMo-2-0425-1B --use_flash_attn false --max_seq_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 5e-07 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.0 --num_epochs 3 --output_dir output/dpo_olmo_core_debug/ --logging_steps 1 --mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 100 --chat_template_name olmo --seed 123 --try_launch_beaker_eval_jobs false but it (a) errored initially after building the cache and then (b) hung on further training. Just make a beaker image with beaker://ai2/cuda12.8-dev-ubuntu22.04-notorch and try run with uv to recreate.
It looks like the multi-node job ran okay but exited with an error? Is that fixable?

OLMo-core's prepare_training_environment() handles multi-node setup internally using Beaker's environment variables. The explicit --nnodes, --standalone, and --rdzv_backend=c10d arguments interfere with this and cause RendezvousTimeoutError on multi-node runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py - Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree - Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP - Fix critical bug: apply HSDP before computing reference logprobs cache - Add LoRA error check (not supported with OLMo-core) - Remove unreachable make_disable_adapter_context function - Reorganize DPO scripts to scripts/train/debug/dpo/ - Add local.sh for testing without Beaker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Only the main process should create the cache directory and test write permissions. Other ranks now wait at a barrier until this is complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers · 2026-01-21T17:22:30Z

DPO Experiment Results

Ran single GPU DPO script (Beaker) and multi-node DPO (Beaker) scripts.

Results:

Single GPU DPO: ✅ Passed (exit code 0)
Multi-node DPO: ✅ Training completed successfully with checkpoints saved. Had a non-critical exit code 1 due to a multi-node barrier failure in post-training cleanup (a known issue with multi-node jobs where one node finishes before the other).

Changes in this commit:

Move generic OLMo code to olmo_core_utils.py
Add support for tensor, context, and pipeline parallelism (tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree)
Add LoRA error check (not supported with OLMo-core)
Fix critical bug: apply HSDP before computing reference logprobs cache
Fix race condition in reference logprobs cache directory creation
Reorganize DPO scripts to scripts/train/debug/dpo/
Add local.sh for testing without Beaker

Two barrier issues caused "Connection closed by peer" gloo errors during post-training cleanup: 1. Unconditional barrier at start of _handle_post_training called even when distributed training wasn't active 2. Asymmetric barrier inside beaker save conditional - only main_process reached this code due to is_main_process check, causing non-main processes to hang at the barrier while main does file I/O Fix: Gate the initial barrier on is_distributed() and remove the asymmetric inner barrier entirely since only main_process enters that code block anyway. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers · 2026-01-21T17:38:53Z

Mostly some comments. Some other things:

could we add a single-gpu script that runs locally? I tried uv run torchrun --standalone --nproc_per_node=1 open_instruct/dpo.py --model_name_or_path allenai/OLMo-2-0425-1B --tokenizer_name allenai/OLMo-2-0425-1B --use_flash_attn false --max_seq_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 5e-07 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.0 --num_epochs 3 --output_dir output/dpo_olmo_core_debug/ --logging_steps 1 --mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 100 --chat_template_name olmo --seed 123 --try_launch_beaker_eval_jobs false but it (a) errored initially after building the cache and then (b) hung on further training. Just make a beaker image with beaker://ai2/cuda12.8-dev-ubuntu22.04-notorch and try run with uv to recreate.

It looks like the multi-node job ran okay but exited with an error? Is that fixable?

Fixed the multi-node job and adding a local script!

Log entry/exit for all ranks and each step in the export process. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The full_tensor() call on DTensors is a collective operation that requires all ranks to participate. Move the conversion outside the is_main_process check so all ranks call full_tensor(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add drop_last parameter to HFDataLoader. When drop_last=False, pad the remainder with repeated indices to fill a complete batch, ensuring all dataset indices are processed. Use drop_last=False for the cache-building dataloader to prevent -inf values in the reference logprobs cache. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Forward-only cache pass doesn't store activations, so we can use 3x the training batch size. Also display avg_tok/ex, MFU%, and mem_GB in the tqdm progress bar during cache building. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The DPO reference logprobs cache is forward-only (no backward pass), so the full unsharded model may fit in GPU memory and avoids allgather communication overhead. If it OOMs, we catch the error, clear the CUDA cache, apply FSDP, and retry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ze > 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The auto-detection was selecting flash_3 for H100 GPUs without checking if the package is actually installed, causing RuntimeError on startup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use logger instead of print for output - Remove unused model.load_state_dict() call Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hamishivi

LGTM!

* Add OLMo-core based DPO training module - Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support - Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core - Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig - Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions - Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS - Update debug scripts to use torchrun with OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Cleaned up PR. * Add OLMo-core train modules for DPO training Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix SpeedMonitorCallback parameter name Change device_peak_flops_per_second to device_peak_flops to match the OLMo-core API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix CheckpointerCallback save_interval validation Set default checkpointing_steps to 500 when not specified, since the OLMo-core API requires save_interval >= 1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Move checkpointing_steps default value to config class Move the default value for checkpointing_steps (500) from dpo.py to the CheckpointConfig dataclass in dpo_utils.py. This centralizes the default and removes the conditional logic in the callback setup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove duplicate checkpointing_steps field from ExperimentConfig The checkpointing_steps field was defined in both CheckpointConfig (the parent class) and ExperimentConfig. The duplicate field in ExperimentConfig had default=None, which overrode the parent class's default of 500, causing a TypeError when int() was called on None in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Saturn cluster to medium_dpo.sh script Add Saturn as an alternative cluster to help with multi-node scheduling reliability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * updated changelog * Remove explicit torchrun multi-node args from DPO scripts OLMo-core's prepare_training_environment() handles multi-node setup internally using Beaker's environment variables. The explicit --nnodes, --standalone, and --rdzv_backend=c10d arguments interfere with this and cause RendezvousTimeoutError on multi-node runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fixed linter errors * Refactor DPO OLMo-core: add parallelism support, fix HSDP order - Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py - Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree - Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP - Fix critical bug: apply HSDP before computing reference logprobs cache - Add LoRA error check (not supported with OLMo-core) - Remove unreachable make_disable_adapter_context function - Reorganize DPO scripts to scripts/train/debug/dpo/ - Add local.sh for testing without Beaker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix race condition in reference logprobs cache directory creation Only the main process should create the cache directory and test write permissions. Other ranks now wait at a barrier until this is complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix multi-node DPO post-training barrier failures Two barrier issues caused "Connection closed by peer" gloo errors during post-training cleanup: 1. Unconditional barrier at start of _handle_post_training called even when distributed training wasn't active 2. Asymmetric barrier inside beaker save conditional - only main_process reached this code due to is_main_process check, causing non-main processes to hang at the barrier while main does file I/O Fix: Gate the initial barrier on is_distributed() and remove the asymmetric inner barrier entirely since only main_process enters that code block anyway. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove redundant compute_loss_olmo wrapper function ExperimentConfig inherits from DPOConfig, so compute_loss() accepts ExperimentConfig directly. The wrapper was unnecessarily creating a new DPOConfig object when one wasn't needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * run urgent tests * Fix case-insensitive beaker secret lookup Beaker stores secret names case-insensitively, but Python's `in` operator is case-sensitive. This caused lookups for `finbarrt_WANDB_API_KEY` to fail when the secret was stored as `FINBARRT_WANDB_API_KEY`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updated mason.py * Add uv run prefix to local DPO script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Save DPO models in HuggingFace format for evals DPO training was saving models in olmo-core format, but eval jobs and push_folder_to_hub expect HuggingFace format. Use olmo-core's save_hf_model() to convert the trained model to HF format in output_dir/hf_model/ before launching evals or pushing to hub. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix WEKA_CLUSTERS import in submit_eval_jobs.py WEKA_CLUSTERS is defined in launch_utils, not utils. Import launch_utils and use launch_utils.WEKA_CLUSTERS instead of utils.WEKA_CLUSTERS. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO single GPU script to use DPO-trained model Use the DPO-trained OLMo model from allenai/open_instruct_dev with revision dpo_olmo_core_debug_test instead of Qwen/Qwen3-1.7B. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --add_bos flag for OLMo model in GRPO script OLMo models require the --add_bos flag to be set. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Copy original HF config when saving DPO model The save_hf_model() function creates an incorrect config.json with wrong values for num_hidden_layers, eos_token_id, etc. Copy the original model's config.json to preserve the correct values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use Weka path directly for DPO model in GRPO test The HuggingFace model config was still incorrect, so use the Weka path directly where the model was saved. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add logging for config.json save in DPO Helps debug issues with model config not being saved correctly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use new DPO model path Use the latest DPO model that was saved with correct config.json. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DPO HF model saving to use correct layer count The save_hf_model function from olmo-core was creating extra layers in the output. Instead, use convert_state_to_hf with the original HuggingFace config and save using transformers' native save_pretrained. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix OLMo-2-0425-1B config mapping to use correct layer count The olmo2_1B config has 18 layers but the actual HuggingFace model has 16 layers. Use olmo2_1B_v2 which has the correct 16 layers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix HF model loading to use from_config instead of from_pretrained Cannot pass state_dict together with a model name. Use from_config to create the model, then load_state_dict to load the weights. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Revert to using save_hf_model for DPO model saving The convert_state_to_hf approach doesn't work with DTensors from distributed training. Use save_hf_model which handles DTensors properly. The config mapping has been fixed so save_hf_model should now produce correct layer counts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use DPO model with correct 16 layers Use the model saved from the DPO run with fixed config mapping. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Copy original HF config after save_hf_model The save_hf_model function creates an incomplete config.json that is missing fields like max_position_embeddings. Copy the original model's config to ensure vLLM can load the model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use DPO model with complete config Use the model from the DPO run with copied original config that includes max_position_embeddings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add OLMo3-7B DPO script using OLMo-core trainer New script that uses dpo.py (OLMo-core + FSDP) instead of dpo_tune_cache.py (Accelerate + DeepSpeed) for DPO training. Configured for 2 nodes with 8k sequence length. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add documentation for adding OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --no_auto_dataset_cache to DPO script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix multi-node torchrun configuration for DPO Add missing torchrun multi-node parameters: - --nnodes to specify total number of nodes - --node_rank for each node's rank - --master_addr for coordinator address - --master_port for coordinator port These use Beaker environment variables that get substituted at runtime. Without these, each node ran independently without distributed communication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix nnodes to use hardcoded value instead of BEAKER_NUM_REPLICAS BEAKER_NUM_REPLICAS is not a valid Beaker environment variable. Use hardcoded value of 2 to match --num_nodes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add torchrun multi-node parameters to debug DPO multi_node.sh Same fix as 7b_instruct_dpo_olmo_core.sh - add nnodes, node_rank, master_addr, and master_port for proper multi-node coordination. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add OLMO_SHARED_FS=1 env var for multi-node DPO scripts OLMo-core's checkpointing code requires this env var to be set when using a shared filesystem (like Weka) to avoid unnecessary distributed coordination for filesystem operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add comment about cache cleanup for corrupted dataset cache Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove cache cleanup comment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Support separate model config and weights for OLMo-core DPO Allow users to specify a config_name separately from model_name_or_path, enabling local model paths to work with OLMo-core DPO training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix save_hf_model for FSDP-wrapped models in DPO Add export_to_hf() function that builds an unwrapped model from config and loads the FSDP state dict before saving. This avoids the type check failure in olmo-core's get_hf_config() for FSDP-wrapped models. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DTensor to Tensor conversion in export_to_hf Convert DTensors from FSDP state dict to regular CPU tensors before loading into the unwrapped model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix FSDP state_dict collective operation for multi-node export All ranks must participate in model.state_dict() as it's a collective operation for FSDP models. Only rank 0 now saves to disk. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add detailed logging to export_to_hf for debugging Log entry/exit for all ranks and each step in the export process. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DTensor full_tensor() collective operation in export The full_tensor() call on DTensors is a collective operation that requires all ranks to participate. Move the conversion outside the is_main_process check so all ranks call full_tensor(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Clean up debug logging in export_to_hf Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix missing indices in DPO reference logprobs caching Add drop_last parameter to HFDataLoader. When drop_last=False, pad the remainder with repeated indices to fill a complete batch, ensuring all dataset indices are processed. Use drop_last=False for the cache-building dataloader to prevent -inf values in the reference logprobs cache. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add MFU/memory/token metrics to cache building + 3x cache batch size Forward-only cache pass doesn't store activations, so we can use 3x the training batch size. Also display avg_tok/ex, MFU%, and mem_GB in the tqdm progress bar during cache building. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --cache_logprobs_only flag for DPO cache forward-pass benchmarking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update DPO cache benchmark to match production OLMo3-7B config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Now, we avoid the torch warning * 6x cache batch size + mem% in DPO cache tqdm Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Reduce cache batch multiplier to 4x (6x OOMed) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Try unsharded cache build, fall back to FSDP on OOM The DPO reference logprobs cache is forward-only (no backward pass), so the full unsharded model may fit in GPU memory and avoids allgather communication overhead. If it OOMs, we catch the error, clear the CUDA cache, apply FSDP, and retry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix data loader tests that used single_example_collator with batch_size > 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix attn_backend auto-detection: check flash_attn_3 availability The auto-detection was selecting flash_3 for H100 GPUs without checking if the package is actually installed, causing RuntimeError on startup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * added export to HF function * Added script to convert olmo core to HF format. * Add example usage to olmo-core to HF conversion script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix code review issues in convert_olmo_core_to_hf.py - Use logger instead of print for output - Remove unused model.load_state_dict() call Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread open_instruct/dpo.py Outdated

This was referenced Jan 20, 2026

Refactor DPO config: move fields and remove duplicates #1392

Closed

Use simple-parsing for DPO argument parsing #1393

Closed

chatgpt-codex-connector Bot reviewed Jan 20, 2026

View reviewed changes

Comment thread open_instruct/dpo.py

Comment thread open_instruct/dpo.py Outdated

finbarrtimbers force-pushed the finbarr/olmo-core-dpo-base branch from a43a450 to 69120dd Compare January 20, 2026 16:22

finbarrtimbers and others added 7 commits January 20, 2026 10:01

Cleaned up PR.

5595b4f

Add OLMo-core train modules for DPO training

fb00977

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix SpeedMonitorCallback parameter name

cef925d

Change device_peak_flops_per_second to device_peak_flops to match the OLMo-core API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix CheckpointerCallback save_interval validation

1757a2c

Set default checkpointing_steps to 500 when not specified, since the OLMo-core API requires save_interval >= 1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Saturn cluster to medium_dpo.sh script

d5ac201

Add Saturn as an alternative cluster to help with multi-node scheduling reliability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

finbarrtimbers added 2 commits January 20, 2026 12:04

updated changelog

e76b641

Merge branch 'main' into finbarr/olmo-core-dpo-base

6de3e4a

finbarrtimbers requested a review from hamishivi January 20, 2026 19:39

hamishivi requested changes Jan 20, 2026

View reviewed changes

Comment thread open_instruct/dpo.py Outdated

Comment thread open_instruct/dpo.py Outdated

Comment thread open_instruct/dpo_utils.py

Comment thread scripts/train/debug/dpo/single_gpu.sh

Comment thread open_instruct/dpo.py Outdated

Comment thread open_instruct/dpo.py Outdated

finbarrtimbers and others added 5 commits January 21, 2026 09:30

fixed linter errors

72510aa

Merge branch 'main' into finbarr/olmo-core-dpo-base

03ff5af

Fix race condition in reference logprobs cache directory creation

30b1f02

Only the main process should create the cache directory and test write permissions. Other ranks now wait at a barrier until this is complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Merge branch 'main' into finbarr/olmo-core-dpo-base

24c8aa8

finbarrtimbers requested a review from hamishivi January 21, 2026 20:52

finbarrtimbers and others added 18 commits January 22, 2026 12:38

Add detailed logging to export_to_hf for debugging

91260b5

Log entry/exit for all ranks and each step in the export process. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Clean up debug logging in export_to_hf

38a42a5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add --cache_logprobs_only flag for DPO cache forward-pass benchmarking

c80a594

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update DPO cache benchmark to match production OLMo3-7B config

a8753ba

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Now, we avoid the torch warning

5dd7ef0

Merge branch 'main' into finbarr/olmo-core-dpo-base

998e330

6x cache batch size + mem% in DPO cache tqdm

c80ad94

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Reduce cache batch multiplier to 4x (6x OOMed)

e4bfc8f

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix data loader tests that used single_example_collator with batch_si…

dc9ba10

…ze > 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix attn_backend auto-detection: check flash_attn_3 availability

a939ef2

The auto-detection was selecting flash_3 for H100 GPUs without checking if the package is actually installed, causing RuntimeError on startup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

added export to HF function

bcb89d6

Added script to convert olmo core to HF format.

d8492a6

Add example usage to olmo-core to HF conversion script

aaac2f1

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix code review issues in convert_olmo_core_to_hf.py

a823bf8

- Use logger instead of print for output - Remove unused model.load_state_dict() call Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hamishivi approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'main' into finbarr/olmo-core-dpo-base

b4969b4

finbarrtimbers enabled auto-merge January 26, 2026 17:49

finbarrtimbers added this pull request to the merge queue Jan 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to a conflict with the base branch Jan 26, 2026

Merge branch 'main' into finbarr/olmo-core-dpo-base

cefeb01

finbarrtimbers enabled auto-merge January 26, 2026 17:58

finbarrtimbers added this pull request to the merge queue Jan 26, 2026

Merged via the queue into main with commit 8befd55 Jan 26, 2026
7 checks passed

finbarrtimbers deleted the finbarr/olmo-core-dpo-base branch January 26, 2026 18:13

		device_name = utils.get_device_name(torch.cuda.get_device_name(0))
		device_peak_flops = int(utils.GPU_SPECS[device_name]["flops"])

Conversation

finbarrtimbers commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers commented Jan 20, 2026

Experiment Results

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

finbarrtimbers commented Jan 21, 2026

DPO Experiment Results

Results:

Changes in this commit:

Uh oh!

finbarrtimbers commented Jan 21, 2026

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Jan 20, 2026 •

edited

Loading