- Once you get over 3B parameters then it is better with this method
- Predicting MULTIPLE future tokens - works with larger models (above 3B)

- Often models are good at one task, ie. summarizing, translation, classification

- We want a model that can do everything
- Want them to be able to take any task and do a good job with it
- ALSO want them to be ethical, unbiased, helpful, safe
- Not very good alignment right now

- Not that easy
- If misspelling very often, then model could learn that misspelling
- You must say... "I don't want this thing, I want this"

- Do you like it / do you not like it

- This is not the natural prediction for next word.
- Will need to tweak the training

- Reward system - take action based on reward structure
- Usually there is no reward until the very end
- At every decision, there is an infinite number of combinations
- Goal: Find a policy -
- A policy is the conditional probability to take an action A at time t when you are in state S at time T. YOu should take an action A at time S

- Optimize the reward over time
- Can be deterministic or stochastic
- Similar to what we are doing with predicting tokens given the context
- Estimate the value of a state if you take an action A given policy, pi.
- Chess - assign a value for this situation















