66 lines (57 loc) · 2.66 KB

This weeks News

Once you get over 3B parameters then it is better with this method
Predicting MULTIPLE future tokens - works with larger models (above 3B)

Where are we? Training after Pre-Training & Supervised Fine-Tuning

Often models are good at one task, ie. summarizing, translation, classification
We want a model that can do everything
Want them to be able to take any task and do a good job with it
ALSO want them to be ethical, unbiased, helpful, safe

The Alignment Problem

Not very good alignment right now
Not that easy
If misspelling very often, then model could learn that misspelling
You must say... "I don't want this thing, I want this"
Do you like it / do you not like it
This is not the natural prediction for next word.
Will need to tweak the training

Alignment and Training

Reinforcement Learning

Reward system - take action based on reward structure
Usually there is no reward until the very end

At every decision, there is an infinite number of combinations
Goal: Find a policy -
- A policy is the conditional probability to take an action A at time t when you are in state S at time T. YOu should take an action A at time S
- Optimize the reward over time
- Can be deterministic or stochastic
- Similar to what we are doing with predicting tokens given the context
- - Estimate the value of a state if you take an action A given policy, pi.
  - Chess - assign a value for this situation

Instruction-based Training

Reinforcement Learning from Human Feedback

^ Want to optimize this
Can't just optimize for just one reward number. Need to keep the model pretty similar to the original model - you are fine tuning not redoing

DPO