Context!

How to measure how well the model works
- Check similarity (rouge)
- Human in the loop
- If the model has never been trained on a chocolate chip recipe, will it be able to make that recipe
- Have agent models evaluate the model
  - One likes Cakey cookies, another likes crispy, and they all score the model
  - Voting scheme to scale the votes of each of the agents
- Models that are trained to generate an embedding for a whole sentence. The whole recipe. Just that ingredient.
  - Getting a fuzzier match, doesn't need to be "cookie" can be "biscuit"
- Not simple like a math problem, many right answers
- Pretty quickly the transformer models were doing better than the humans because the model was jumbling words
- Changing the order of the question and giving the text, would change the response.
- Say what you want and say what you don't want

If you have a fixed set of words, your model is immediately out of date, new words being created
Break words into parts (tokens)
30,000 - 50,000 tokens instead of over 100k words
- Temperature at 1 is exactly what came in
  - Higher than 1 will be softer
    - Creative writing (temp: 1.5)
  - Lower than 1 will be sharper
    - Non-fiction writing (temp: .6)
    - A SUPER low temp (0.00001) may push it off in a weird direction, if there are 2 that are pretty close to being correct it will create a Huge difference between them even though there is a pretty small difference

Python Notebook

cos_sim

Hugging Face