230 lines (198 loc) · 24.3 KB

Selected Experiment Reports

This is a semi-automatically generated list of experiment issues and reports. It is periodically updated by a script, and then curated by hand.

This page includes only experiments that have at least one run or report.

Marin 8B Base

Tootsie 8B (Main Issue)

Cooldowns

Try deepening the cooldown of "monumental-jellyfish" (tootsie 8b cooldown 1) to see if it improves SFT
not-quite-so-deep cooldown (Spoonbill)
- GitHub Issue #916
- WandB Report
- Data Browser
- Conclusion: Exploding logits in deep parts of cooldown can be mitigated by Z-Loss.
Tootsie Phoenix Cooldown (sensible-starling)

Marin 32B

Marin 32B

Big Runs

[EPIC] Big Runs
- GitHub Issue #859
- WandB Report
Marin 13B
- GitHub Issue #860
- WandB Report
Marin 24B
- GitHub Issue #861
- WandB Report
Marin 70b
- GitHub Issue #750
- WandB Report

Modeling

Pick Tokenizer type
- GitHub Issue #524
- WandB Report
- Conclusion: Llama3 tokenizer is the best.
Default z-loss?
- GitHub Issue #935
- WandB Report
- Conclusion: z-loss seems not harmful. We'll use it.
Figuring out learning rate schedule!
- GitHub Issue #764
- WandB Report
- Conclusion: Cosine is best. High LR is important. WSD isn't terrible.
Mixture of Experts
- GitHub Issue #929
- WandB Report
Hybrid Norm and Input Embedding Norm
- GitHub Issue #961
- WandB Report
OLMoE replication - MoE vs dense
- GitHub Issue #1183
- WandB Report
- Conclusion: Despite having a lower MFU, MoE outperforms similar sized dense model in both training and evaluation.

Training and Performance

INT8 training in Levanter
- GitHub Issue #620
- WandB Report
- Conclusion: Int8 training is much faster on the right hardware, but might lead to worse performance in terms of time-to-loss except in the early stages.
MuP for scaling laws
- GitHub Issue #621
- WandB Report
- Conclusion: not worth it compared to our heuristic version.
Figuring out learning rate schedule!
- GitHub Issue #764
- WandB Report
Try out different remat strategies to get the 70b working on fewer slices
- GitHub Issue #906
- WandB Report
- Data Browser
- Conclusion: Substantial performance hit but helpful. Still need to iterate.
Fantastic Pretraining Optimizers And Where to Find them
- GitHub Issue #1290
- WandB Report
Qwen (QK Norm) Speedruns

Data Experiments

High Quality Data Ablations

Ablations on Cooldown for Markdownified Wikipedia
- GitHub Issue #845
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
Ablations on Cooldown for Markdownified Arxiv
- GitHub Issue #846
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
Ablations on Cooldown for Markdownified StackExchange
- GitHub Issue #847
- WandB Report
- Data Browser
- Conclusion: No major improvement compared to control.
Mixture of Formats Training on Wikipedia and Arxiv
- GitHub Issue #818
- WandB Report
- Conclusion: No major difference observed, switch to @Helw150's annealing setup for evaluations.
High Quality Many Epochs vs. Low Quality Few Epochs
- GitHub Issue #636
- WandB Report
- Conclusion: There's no data like more data.
Add MegaMath data and run ablation
OpenWebMath Crawl Annealing Experiments
Fineweb-Edu Crawl Annealing Experiments

Data Filtering

Stack Exchange Quality Classifier
- GitHub Issue #596
- WandB Report
- Conclusion: Seems to lead to better loss than using Reddit ELI5 or OpenHermes.
- NOTE: this seems like a loose end, we should pursue this further.
Cascading Quality Filters
- GitHub Issue #963
- Data Browser

Text Extraction and Formatting

Compare HTML -> text methods
- GitHub Issue #246
- WandB Report
- Data Browser
- Conclusion: some amount of format preservation is helpful for loss on Paloma.
Wikipedia Training Runs with DOLMA source substitution
Ar5iv Training Runs with DOLMA source substitution
- GitHub Issue #648
- Data Browser
Markdownification Processing Report

Supervised Fine Tuning

Reproduce olmov2 SFT
- GitHub Issue #606
- WandB Run: marin_olmo_tulu_sft_v3-acca67
Create Mixture from All SFT Datasets
- GitHub Issue #804
- WandB Report
Add Llama-Nemotron & Openthoughts3 into post-training dataset
- GitHub Issue #905
- WandB Run: deeper_starling_sft_nemotron_and_openthoughts3
SFT on further cool-downed tootsie checkpoints
- GitHub Issue #897
SFT Deeper Starling
- GitHub Issue #1237
- WandB Run: deeper_mixture_sft_starling_1e-4-longer-2

Scaling Laws

Scaling laws to predict tootsie performance
Optimizer Scaling Law Part 1: AdamW
- GitHub Issue #725
- WandB Report
- Conclusion: After sweeping, we discovered that the (near) optimal set of hyperparameters for AdamW remains surprisingly stable across three settings.
Verify scaling batch size widens the gap between Muon & AdamW
- GitHub Issue #1565
- WandB Report

Baselines and Reproductions

Train a simple Dolma/Olmo baseline to flex the pipeline
- GitHub Issue #442
- WandB Report
Build DCLM 7b baseline
- GitHub Issue #143
- WandB Report

Other Projects

Compel

Compression-Ratio Quality Filter 1.4B Models

Uncategorized

(This is for experiments that have been added via the script but have not yet been curated.)