This is a semi-automatically generated list of experiment issues and reports. It is periodically updated by a script, and then curated by hand.
This page includes only experiments that have at least one run or report.
- Try deepening the cooldown of "monumental-jellyfish" (tootsie 8b cooldown 1) to see if it improves SFT
- not-quite-so-deep cooldown (Spoonbill)
- GitHub Issue #916
- WandB Report
- Data Browser
- Conclusion: Exploding logits in deep parts of cooldown can be mitigated by Z-Loss.
- Tootsie Phoenix Cooldown (sensible-starling)
- Pick Tokenizer type
- GitHub Issue #524
- WandB Report
- Conclusion: Llama3 tokenizer is the best.
- Default z-loss?
- GitHub Issue #935
- WandB Report
- Conclusion: z-loss seems not harmful. We'll use it.
- Figuring out learning rate schedule!
- GitHub Issue #764
- WandB Report
- Conclusion: Cosine is best. High LR is important. WSD isn't terrible.
- Mixture of Experts
- Hybrid Norm and Input Embedding Norm
- OLMoE replication - MoE vs dense
- GitHub Issue #1183
- WandB Report
- Conclusion: Despite having a lower MFU, MoE outperforms similar sized dense model in both training and evaluation.
- INT8 training in Levanter
- GitHub Issue #620
- WandB Report
- Conclusion: Int8 training is much faster on the right hardware, but might lead to worse performance in terms of time-to-loss except in the early stages.
- MuP for scaling laws
- GitHub Issue #621
- WandB Report
- Conclusion: not worth it compared to our heuristic version.
- Figuring out learning rate schedule!
- Try out different remat strategies to get the 70b working on fewer slices
- GitHub Issue #906
- WandB Report
- Data Browser
- Conclusion: Substantial performance hit but helpful. Still need to iterate.
- Fantastic Pretraining Optimizers And Where to Find them
- Qwen (QK Norm) Speedruns
- Ablations on Cooldown for Markdownified Wikipedia
- GitHub Issue #845
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
- Ablations on Cooldown for Markdownified Arxiv
- GitHub Issue #846
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
- Ablations on Cooldown for Markdownified StackExchange
- GitHub Issue #847
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
- Mixture of Formats Training on Wikipedia and Arxiv
- GitHub Issue #818
- WandB Report
- Conclusion: No major difference observed, switch to @Helw150's annealing setup for evaluations.
- High Quality Many Epochs vs. Low Quality Few Epochs
- GitHub Issue #636
- WandB Report
- Conclusion: There's no data like more data.
- Add MegaMath data and run ablation
- OpenWebMath Crawl Annealing Experiments
- Fineweb-Edu Crawl Annealing Experiments
- Stack Exchange Quality Classifier
- GitHub Issue #596
- WandB Report
- Conclusion: Seems to lead to better loss than using Reddit ELI5 or OpenHermes.
- NOTE: this seems like a loose end, we should pursue this further.
- Cascading Quality Filters
- Compare HTML -> text methods
- GitHub Issue #246
- WandB Report
- Data Browser
- Conclusion: some amount of format preservation is helpful for loss on Paloma.
- Wikipedia Training Runs with DOLMA source substitution
- Ar5iv Training Runs with DOLMA source substitution
- Markdownification Processing Report
- Reproduce olmov2 SFT
- Create Mixture from All SFT Datasets
- Add Llama-Nemotron & Openthoughts3 into post-training dataset
- SFT on further cool-downed tootsie checkpoints
- SFT Deeper Starling
- Scaling laws to predict tootsie performance
- Optimizer Scaling Law Part 1: AdamW
- GitHub Issue #725
- WandB Report
- Conclusion: After sweeping, we discovered that the (near) optimal set of hyperparameters for AdamW remains surprisingly stable across three settings.
- Verify scaling batch size widens the gap between Muon & AdamW
(This is for experiments that have been added via the script but have not yet been curated.)