Add OLMo-core based DPO training module#1391
Conversation
Summary of ChangesHello @finbarrtimbers, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the training capabilities by integrating Direct Preference Optimization (DPO) directly into the OLMo-core framework. It introduces a dedicated DPO training module that leverages OLMo-core's distributed training infrastructure, including HSDP, to enable scalable and efficient fine-tuning of large language models using preference data. The changes streamline the process of applying DPO by providing specialized utilities for reference log probability caching and adapting model forward passes for the OLMo-core architecture. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new DPO (Direct Preference Optimization) training module that leverages OLMo-core's native training infrastructure, including its TrainModule and HSDP support. The changes are well-structured, integrating new utility functions for reference log-probability caching and loss computation tailored for OLMo-core models. The accompanying debug scripts have been updated to reflect the new torchrun based launch mechanism and OLMo-core specific model configurations. The implementation appears robust and correctly handles distributed training aspects.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a43a450fd7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| device_name = utils.get_device_name(torch.cuda.get_device_name(0)) | ||
| device_peak_flops = int(utils.GPU_SPECS[device_name]["flops"]) |
There was a problem hiding this comment.
Guard GPU-only device name lookup
This module explicitly falls back to CPU (device = "cpu" when CUDA is unavailable), but it later unconditionally calls torch.cuda.get_device_name(0). On a CPU-only host (or any environment where CUDA isn’t initialized), that call raises and the training run crashes before callbacks are built. If CPU fallback is intentional, this needs a CUDA availability guard or to skip the speed monitor setup when CUDA isn’t available.
Useful? React with 👍 / 👎.
| def make_disable_adapter_context() -> contextlib.AbstractContextManager: | ||
| if args.use_lora: | ||
| assert isinstance(model, peft.PeftModel) | ||
| return model.disable_adapter() |
There was a problem hiding this comment.
--use_lora crashes without LoRA setup
When --use_lora is enabled, the code asserts that the model is already a peft.PeftModel, but this script never applies any LoRA wrapping to the OLMo-core model. As a result, any run that enables --use_lora will immediately assert and abort during reference logprob caching. Either LoRA needs to be applied before this point or the script should error out earlier with a clear “not supported” message.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a43a450fd7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support - Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core - Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig - Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions - Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS - Update debug scripts to use torchrun with OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
a43a450 to
69120dd
Compare
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Change device_peak_flops_per_second to device_peak_flops to match the OLMo-core API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Set default checkpointing_steps to 500 when not specified, since the OLMo-core API requires save_interval >= 1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move the default value for checkpointing_steps (500) from dpo.py to the CheckpointConfig dataclass in dpo_utils.py. This centralizes the default and removes the conditional logic in the callback setup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The checkpointing_steps field was defined in both CheckpointConfig (the parent class) and ExperimentConfig. The duplicate field in ExperimentConfig had default=None, which overrode the parent class's default of 500, causing a TypeError when int() was called on None in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Saturn as an alternative cluster to help with multi-node scheduling reliability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Experiment ResultsRan single GPU DPO script (Beaker) successfully. Multi-node DPO experiments are experiencing Beaker cluster rendezvous timeout issues (infrastructure-related, not code issues). Will re-run when cluster stability improves. |
hamishivi
left a comment
There was a problem hiding this comment.
Mostly some comments. Some other things:
-
could we add a single-gpu script that runs locally? I tried
uv run torchrun --standalone --nproc_per_node=1 open_instruct/dpo.py --model_name_or_path allenai/OLMo-2-0425-1B --tokenizer_name allenai/OLMo-2-0425-1B --use_flash_attn false --max_seq_length 1024 --per_device_train_batch_size 1 --gradient_accumulation_steps 4 --learning_rate 5e-07 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.0 --num_epochs 3 --output_dir output/dpo_olmo_core_debug/ --logging_steps 1 --mixer_list allenai/tulu-3-wildchat-reused-on-policy-8b 100 --chat_template_name olmo --seed 123 --try_launch_beaker_eval_jobs falsebut it (a) errored initially after building the cache and then (b) hung on further training. Just make a beaker image withbeaker://ai2/cuda12.8-dev-ubuntu22.04-notorchand try run with uv to recreate. -
It looks like the multi-node job ran okay but exited with an error? Is that fixable?
OLMo-core's prepare_training_environment() handles multi-node setup internally using Beaker's environment variables. The explicit --nnodes, --standalone, and --rdzv_backend=c10d arguments interfere with this and cause RendezvousTimeoutError on multi-node runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py - Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree - Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP - Fix critical bug: apply HSDP before computing reference logprobs cache - Add LoRA error check (not supported with OLMo-core) - Remove unreachable make_disable_adapter_context function - Reorganize DPO scripts to scripts/train/debug/dpo/ - Add local.sh for testing without Beaker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Only the main process should create the cache directory and test write permissions. Other ranks now wait at a barrier until this is complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
DPO Experiment ResultsRan single GPU DPO script (Beaker) and multi-node DPO (Beaker) scripts. Results:
Changes in this commit:
|
Two barrier issues caused "Connection closed by peer" gloo errors during post-training cleanup: 1. Unconditional barrier at start of _handle_post_training called even when distributed training wasn't active 2. Asymmetric barrier inside beaker save conditional - only main_process reached this code due to is_main_process check, causing non-main processes to hang at the barrier while main does file I/O Fix: Gate the initial barrier on is_distributed() and remove the asymmetric inner barrier entirely since only main_process enters that code block anyway. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixed the multi-node job and adding a local script! |
Log entry/exit for all ranks and each step in the export process. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The full_tensor() call on DTensors is a collective operation that requires all ranks to participate. Move the conversion outside the is_main_process check so all ranks call full_tensor(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add drop_last parameter to HFDataLoader. When drop_last=False, pad the remainder with repeated indices to fill a complete batch, ensuring all dataset indices are processed. Use drop_last=False for the cache-building dataloader to prevent -inf values in the reference logprobs cache. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Forward-only cache pass doesn't store activations, so we can use 3x the training batch size. Also display avg_tok/ex, MFU%, and mem_GB in the tqdm progress bar during cache building. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The DPO reference logprobs cache is forward-only (no backward pass), so the full unsharded model may fit in GPU memory and avoids allgather communication overhead. If it OOMs, we catch the error, clear the CUDA cache, apply FSDP, and retry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ze > 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The auto-detection was selecting flash_3 for H100 GPUs without checking if the package is actually installed, causing RuntimeError on startup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use logger instead of print for output - Remove unused model.load_state_dict() call Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Add OLMo-core based DPO training module - Add dpo.py: New DPO training module using OLMo-core's TrainModule with HSDP support - Add build_reference_logprobs_cache_olmo: Generic reference logprobs caching for OLMo-core - Add compute_loss_olmo: Wrapper for DPO loss computation with ExperimentConfig - Add concatenated_forward_olmo and separate_forward_olmo: OLMo-core forward functions - Update mason.py: Add dpo.py to OPEN_INSTRUCT_COMMANDS - Update debug scripts to use torchrun with OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Cleaned up PR. * Add OLMo-core train modules for DPO training Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix SpeedMonitorCallback parameter name Change device_peak_flops_per_second to device_peak_flops to match the OLMo-core API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix CheckpointerCallback save_interval validation Set default checkpointing_steps to 500 when not specified, since the OLMo-core API requires save_interval >= 1. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Move checkpointing_steps default value to config class Move the default value for checkpointing_steps (500) from dpo.py to the CheckpointConfig dataclass in dpo_utils.py. This centralizes the default and removes the conditional logic in the callback setup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove duplicate checkpointing_steps field from ExperimentConfig The checkpointing_steps field was defined in both CheckpointConfig (the parent class) and ExperimentConfig. The duplicate field in ExperimentConfig had default=None, which overrode the parent class's default of 500, causing a TypeError when int() was called on None in dpo.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add Saturn cluster to medium_dpo.sh script Add Saturn as an alternative cluster to help with multi-node scheduling reliability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * updated changelog * Remove explicit torchrun multi-node args from DPO scripts OLMo-core's prepare_training_environment() handles multi-node setup internally using Beaker's environment variables. The explicit --nnodes, --standalone, and --rdzv_backend=c10d arguments interfere with this and cause RendezvousTimeoutError on multi-node runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fixed linter errors * Refactor DPO OLMo-core: add parallelism support, fix HSDP order - Move OLMO_MODEL_CONFIG_MAP and get_transformer_config to olmo_core_utils.py - Add tensor_parallel_degree, context_parallel_degree, pipeline_parallel_degree - Replace _apply_hsdp with _apply_parallelism supporting TP/CP/PP - Fix critical bug: apply HSDP before computing reference logprobs cache - Add LoRA error check (not supported with OLMo-core) - Remove unreachable make_disable_adapter_context function - Reorganize DPO scripts to scripts/train/debug/dpo/ - Add local.sh for testing without Beaker Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix race condition in reference logprobs cache directory creation Only the main process should create the cache directory and test write permissions. Other ranks now wait at a barrier until this is complete. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix multi-node DPO post-training barrier failures Two barrier issues caused "Connection closed by peer" gloo errors during post-training cleanup: 1. Unconditional barrier at start of _handle_post_training called even when distributed training wasn't active 2. Asymmetric barrier inside beaker save conditional - only main_process reached this code due to is_main_process check, causing non-main processes to hang at the barrier while main does file I/O Fix: Gate the initial barrier on is_distributed() and remove the asymmetric inner barrier entirely since only main_process enters that code block anyway. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove redundant compute_loss_olmo wrapper function ExperimentConfig inherits from DPOConfig, so compute_loss() accepts ExperimentConfig directly. The wrapper was unnecessarily creating a new DPOConfig object when one wasn't needed. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * run urgent tests * Fix case-insensitive beaker secret lookup Beaker stores secret names case-insensitively, but Python's `in` operator is case-sensitive. This caused lookups for `finbarrt_WANDB_API_KEY` to fail when the secret was stored as `FINBARRT_WANDB_API_KEY`. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updated mason.py * Add uv run prefix to local DPO script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Save DPO models in HuggingFace format for evals DPO training was saving models in olmo-core format, but eval jobs and push_folder_to_hub expect HuggingFace format. Use olmo-core's save_hf_model() to convert the trained model to HF format in output_dir/hf_model/ before launching evals or pushing to hub. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix WEKA_CLUSTERS import in submit_eval_jobs.py WEKA_CLUSTERS is defined in launch_utils, not utils. Import launch_utils and use launch_utils.WEKA_CLUSTERS instead of utils.WEKA_CLUSTERS. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO single GPU script to use DPO-trained model Use the DPO-trained OLMo model from allenai/open_instruct_dev with revision dpo_olmo_core_debug_test instead of Qwen/Qwen3-1.7B. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --add_bos flag for OLMo model in GRPO script OLMo models require the --add_bos flag to be set. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Copy original HF config when saving DPO model The save_hf_model() function creates an incorrect config.json with wrong values for num_hidden_layers, eos_token_id, etc. Copy the original model's config.json to preserve the correct values. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use Weka path directly for DPO model in GRPO test The HuggingFace model config was still incorrect, so use the Weka path directly where the model was saved. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add logging for config.json save in DPO Helps debug issues with model config not being saved correctly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use new DPO model path Use the latest DPO model that was saved with correct config.json. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DPO HF model saving to use correct layer count The save_hf_model function from olmo-core was creating extra layers in the output. Instead, use convert_state_to_hf with the original HuggingFace config and save using transformers' native save_pretrained. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix OLMo-2-0425-1B config mapping to use correct layer count The olmo2_1B config has 18 layers but the actual HuggingFace model has 16 layers. Use olmo2_1B_v2 which has the correct 16 layers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix HF model loading to use from_config instead of from_pretrained Cannot pass state_dict together with a model name. Use from_config to create the model, then load_state_dict to load the weights. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Revert to using save_hf_model for DPO model saving The convert_state_to_hf approach doesn't work with DTensors from distributed training. Use save_hf_model which handles DTensors properly. The config mapping has been fixed so save_hf_model should now produce correct layer counts. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use DPO model with correct 16 layers Use the model saved from the DPO run with fixed config mapping. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Copy original HF config after save_hf_model The save_hf_model function creates an incomplete config.json that is missing fields like max_position_embeddings. Copy the original model's config to ensure vLLM can load the model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update GRPO script to use DPO model with complete config Use the model from the DPO run with copied original config that includes max_position_embeddings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add OLMo3-7B DPO script using OLMo-core trainer New script that uses dpo.py (OLMo-core + FSDP) instead of dpo_tune_cache.py (Accelerate + DeepSpeed) for DPO training. Configured for 2 nodes with 8k sequence length. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add documentation for adding OLMo-core models Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --no_auto_dataset_cache to DPO script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix multi-node torchrun configuration for DPO Add missing torchrun multi-node parameters: - --nnodes to specify total number of nodes - --node_rank for each node's rank - --master_addr for coordinator address - --master_port for coordinator port These use Beaker environment variables that get substituted at runtime. Without these, each node ran independently without distributed communication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix nnodes to use hardcoded value instead of BEAKER_NUM_REPLICAS BEAKER_NUM_REPLICAS is not a valid Beaker environment variable. Use hardcoded value of 2 to match --num_nodes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add torchrun multi-node parameters to debug DPO multi_node.sh Same fix as 7b_instruct_dpo_olmo_core.sh - add nnodes, node_rank, master_addr, and master_port for proper multi-node coordination. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add OLMO_SHARED_FS=1 env var for multi-node DPO scripts OLMo-core's checkpointing code requires this env var to be set when using a shared filesystem (like Weka) to avoid unnecessary distributed coordination for filesystem operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add comment about cache cleanup for corrupted dataset cache Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove cache cleanup comment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Support separate model config and weights for OLMo-core DPO Allow users to specify a config_name separately from model_name_or_path, enabling local model paths to work with OLMo-core DPO training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix save_hf_model for FSDP-wrapped models in DPO Add export_to_hf() function that builds an unwrapped model from config and loads the FSDP state dict before saving. This avoids the type check failure in olmo-core's get_hf_config() for FSDP-wrapped models. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DTensor to Tensor conversion in export_to_hf Convert DTensors from FSDP state dict to regular CPU tensors before loading into the unwrapped model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix FSDP state_dict collective operation for multi-node export All ranks must participate in model.state_dict() as it's a collective operation for FSDP models. Only rank 0 now saves to disk. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add detailed logging to export_to_hf for debugging Log entry/exit for all ranks and each step in the export process. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix DTensor full_tensor() collective operation in export The full_tensor() call on DTensors is a collective operation that requires all ranks to participate. Move the conversion outside the is_main_process check so all ranks call full_tensor(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Clean up debug logging in export_to_hf Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix missing indices in DPO reference logprobs caching Add drop_last parameter to HFDataLoader. When drop_last=False, pad the remainder with repeated indices to fill a complete batch, ensuring all dataset indices are processed. Use drop_last=False for the cache-building dataloader to prevent -inf values in the reference logprobs cache. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add MFU/memory/token metrics to cache building + 3x cache batch size Forward-only cache pass doesn't store activations, so we can use 3x the training batch size. Also display avg_tok/ex, MFU%, and mem_GB in the tqdm progress bar during cache building. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add --cache_logprobs_only flag for DPO cache forward-pass benchmarking Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update DPO cache benchmark to match production OLMo3-7B config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Now, we avoid the torch warning * 6x cache batch size + mem% in DPO cache tqdm Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Reduce cache batch multiplier to 4x (6x OOMed) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Try unsharded cache build, fall back to FSDP on OOM The DPO reference logprobs cache is forward-only (no backward pass), so the full unsharded model may fit in GPU memory and avoids allgather communication overhead. If it OOMs, we catch the error, clear the CUDA cache, apply FSDP, and retry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix data loader tests that used single_example_collator with batch_size > 1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix attn_backend auto-detection: check flash_attn_3 availability The auto-detection was selecting flash_3 for H100 GPUs without checking if the package is actually installed, causing RuntimeError on startup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * added export to HF function * Added script to convert olmo core to HF format. * Add example usage to olmo-core to HF conversion script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix code review issues in convert_olmo_core_to_hf.py - Use logger instead of print for output - Remove unused model.load_state_dict() call Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Adds an Olmo-core based DPO implementation in
dpo.py.Runs:
GPU_TESTS=01KFH82YPAC7ECB4XXEHH10TS7