Skip to content

[WIP] Add mixed precision gradient collection#189

Open
luciaquirke wants to merge 2 commits intomainfrom
mixed-prec
Open

[WIP] Add mixed precision gradient collection#189
luciaquirke wants to merge 2 commits intomainfrom
mixed-prec

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Mar 11, 2026

  • Always leave optimizers and preconditioners in fp32
  • Use mixed precision fwd-bwd

Coreset regression testing (last regression: #178)

Full dataset, 8 GPUs (really around 4762 items):

[maybe needs redoing for ebs]
Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=10000-s=42
Filter: trackstar
Num training examples: 10000
Final eval loss: 0.6909899711608887

Random, 8 GPUs (1k, effective batch size = 32):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-random-n=1000-s=42
Filter: random
Num training examples: 1000
Final eval loss: 0.8014830350875854

Attribution, 8 GPUs (1k, effective batch size = 32):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-attribution-n=1000-s=42
Filter: attribution
Num training examples: 1000
Final eval loss: 0.7693212628364563

FP32 TrackStar, 8 GPUs (1k):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=1000-s=42
Filter: trackstar
Num training examples: 1000
Final eval loss: 0.7991439700126648

BF16 TrackStar, 8 GPUs (1k):

Run: SmolLM2-1.7B-Instruct-magpie-ultra-v0.1-trackstar-n=1000-s=42
Filter: trackstar
Num training examples: 1000
Final eval loss: 0.7762771844863892

The training loss drop on this dataset sucks because the model has already undergone IFT so I did a second round using Qwen + LoRA + a dataset Qwen's bad at:

 torchrun --nproc_per_node 8 -m examples.filter_data \
    --model Qwen/Qwen2.5-1.5B \
    --dataset sander-wood/irishman \
    --prompt_column "abc notation" \
    --max_samples 10000 \
    --num_examples 1000 \
    --num_epochs 1 \
    --precision fp32 \
    --learning_rate 5e-5 \
    --subset default \
    --filter random 
  • Full dataset, 8 GPUs (1k, effective batch size = 128, lr=5e-5)

Results

Full dataset

Random

Attribution

TrackStar FP32

TrackStar BF16

TODO

  • check we maintain correct dtype outside trackstar, in normalizers

@luciaquirke luciaquirke requested a review from LouisYRYJ March 11, 2026 04:23
@luciaquirke luciaquirke changed the title Add mixed precision gradient collection [WIP] Add mixed precision gradient collection Mar 11, 2026
@LouisYRYJ
Copy link
Contributor

Generally, will mixed precision be adjustable in the config or not?
Are there downsides to using it?

@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Mar 13, 2026

Generally, will mixed precision be adjustable in the config or not? Are there downsides to using it?

I think we will not let it be adjustable for now. The attribution accuracy should be all upside because we are more closely matching bf16/fp16 training (pure bf16 or fp16 training is a thing but it's vanishingly rare). The downside is that fitting normalizers and preconditioners will presumably use more VRAM/wall clock time. I think I'm comfortable with this because we can get a good fit for these values in ~10k data points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants