[WIP] Add mixed precision gradient collection by luciaquirke · Pull Request #189 · EleutherAI/bergson

luciaquirke · 2026-03-11T04:22:36Z

Always leave optimizers and preconditioners in fp32
Use mixed precision fwd-bwd

Coreset regression testing (last regression: #178)

Full dataset, 8 GPUs (really around 4762 items):

[maybe needs redoing for ebs]
Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=10000-s=42
Filter: trackstar
Num training examples: 10000
Final eval loss: 0.6909899711608887

Random, 8 GPUs (1k, effective batch size = 32):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-random-n=1000-s=42
Filter: random
Num training examples: 1000
Final eval loss: 0.8014830350875854

Attribution, 8 GPUs (1k, effective batch size = 32):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-attribution-n=1000-s=42
Filter: attribution
Num training examples: 1000
Final eval loss: 0.7693212628364563

FP32 TrackStar, 8 GPUs (1k):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=1000-s=42
Filter: trackstar
Num training examples: 1000
Final eval loss: 0.7991439700126648

BF16 TrackStar, 8 GPUs (1k):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=1000-s=42
Filter: trackstar
Num training examples: 1000
Final eval loss: 0.7762771844863892

The training loss drop on this dataset sucks because the model has already undergone IFT so I did a second round using Qwen + LoRA + a dataset Qwen's bad at:

 torchrun --nproc_per_node 8 -m examples.filter_data \
    --model Qwen/Qwen2.5-1.5B \
    --dataset sander-wood/irishman \
    --prompt_column "abc notation" \
    --max_samples 10000 \
    --num_examples 1000 \
    --num_epochs 1 \
    --precision fp32 \
    --learning_rate 5e-5 \
    --subset default \
    --filter random

Full dataset, 8 GPUs (1k, effective batch size = 128, lr=5e-5)

Results

Full dataset

Random

Attribution

TrackStar FP32

TrackStar BF16

TODO

check we maintain correct dtype outside trackstar, in normalizers

bergson/normalizer/fit_normalizers.py

bergson/config.py

bergson/gradients.py

LouisYRYJ · 2026-03-11T16:33:43Z

Generally, will mixed precision be adjustable in the config or not?
Are there downsides to using it?

bergson/collector/collector.py

luciaquirke · 2026-03-13T03:29:10Z

Generally, will mixed precision be adjustable in the config or not? Are there downsides to using it?

I think we will not let it be adjustable for now. The attribution accuracy should be all upside because we are more closely matching bf16/fp16 training (pure bf16 or fp16 training is a thing but it's vanishingly rare). The downside is that fitting normalizers and preconditioners will presumably use more VRAM/wall clock time. I think I'm comfortable with this because we can get a good fit for these values in ~10k data points.

luciaquirke requested a review from LouisYRYJ March 11, 2026 04:23

luciaquirke changed the title ~~Add mixed precision gradient collection~~ [WIP] Add mixed precision gradient collection Mar 11, 2026

luciaquirke force-pushed the mixed-prec branch from 168c93a to 15490be Compare March 11, 2026 04:30