- x = features
- f(x) = function ~ Makes a prediction/classification based on the inputs
- Need a loss function
- Need an optimization function

- Pink: input
- Blue: H1
- Yellow: H2
- Green: Output
- Need to make sure that shape of matrices are correct

- Output matrix of first step is 1x3

- Either trying to get a probability distribution
- Sigmoid will give a value between 0 and 1
- Softmax will give a value where all values will sub to 1
- Max of 32k, so english language needs to be broken down into tokens
- Means that you will likely go faster because you are using the tokenized values
- Max of 32k, so english language needs to be broken down into tokens
- We will be using back propagation
- We will calculate a loss based on how far off our answers were from the correct answer
- Entropy
- Uncertainty
- An unfair coin (two heads) has an entropy of 0
- Outcome is certain
- A fair coin has an entropy of 1
- 50/50
- 1000 sided die
- The outcome is less certain 1 in 1000
- An unfair coin (two heads) has an entropy of 0
- Uncertainty

- D is the divergence value
- Loss will not be 0 unless you did something wrong
- There will always be some non-zero ambiguity with text
- Today will be ______ ___ (could be day, weather, event, etc.)
- Useful for sparse categorical







