❓ The question
Hi! I'm trying to understand the differences between these two models. Looking at the HF configs and model pages I found the following ones:
- Difference training process, 1 stage vs 2 stages and different version of dolma v1_5 vs v_1_7
- Context length for training, 2048 vs 4096
- embedding weight tying , tied vs untied
- clip_qkv, null vs 8.0
I also looked at the config at configs/official-0724/OLMo-1B.yaml but can't figure out it is supposed to be the 0724 version or the first one. Are there training runs/exact configs for these exact models?
Thanks for any info!