Skip to content

Latest commit

 

History

History
66 lines (57 loc) · 2.66 KB

File metadata and controls

66 lines (57 loc) · 2.66 KB

This weeks News

  • Once you get over 3B parameters then it is better with this method
  • Predicting MULTIPLE future tokens - works with larger models (above 3B)

Where are we? Training after Pre-Training & Supervised Fine-Tuning

  • Often models are good at one task, ie. summarizing, translation, classification
  • We want a model that can do everything
  • Want them to be able to take any task and do a good job with it
  • ALSO want them to be ethical, unbiased, helpful, safe

The Alignment Problem

  • Not very good alignment right now
  • Not that easy
  • If misspelling very often, then model could learn that misspelling
  • You must say... "I don't want this thing, I want this"
  • Do you like it / do you not like it
  • This is not the natural prediction for next word.
  • Will need to tweak the training

Alignment and Training

Reinforcement Learning

  • Reward system - take action based on reward structure
  • Usually there is no reward until the very end

  • At every decision, there is an infinite number of combinations
  • Goal: Find a policy -
    • A policy is the conditional probability to take an action A at time t when you are in state S at time T. YOu should take an action A at time S
    • Optimize the reward over time
    • Can be deterministic or stochastic
    • Similar to what we are doing with predicting tokens given the context
      • Estimate the value of a state if you take an action A given policy, pi.
      • Chess - assign a value for this situation

Instruction-based Training

Reinforcement Learning from Human Feedback

  • ^ Want to optimize this
  • Can't just optimize for just one reward number. Need to keep the model pretty similar to the original model - you are fine tuning not redoing

DPO