Gradient Descent
θj = θj - α * derivative{θj}(J(θ))
Feature scaling:
- Features being on similar scales is better for GD, making it converge faster.
- Rule of thumbs: change scales to between [-1/3; 1/3] and [-3; 3]
Mean normalization: Replace xi with xi - μi to make feature have approximately zero mean.
Linear Regression
derivative{θj}(J(θ)) = (1/m) sum{i}{1}{m}((h(i) - y(i)) * xj(i))
θ = θ - (α/m) * (XT * diff) with diff = X * θ - y.
Normal equation (analytic method):
- θ = (XTX)-1XTy
- Slow if n (# of features) is very large (10k can still be fine, GD can be more suitable for more than that).
- XTX might be noninvertible when: some features are linearly dependent, or there are too many features (n ≥ m)
Logistic Regression
- hθ(X) = 1/(1 + e-θX) and θX = ln(h/(1-h)).
- The odds is the ratio between the amounts staked by parties to a bet. Here the odds is h/(1-h).
- The likelihood of h is a function of θ, which shows the probability that the classification is correct in all examples, based on θ. In the discrete case, each example is a binomial count, so the probability of each example is the PMF that there are yi correct classification out of ni samples in that example, which is C{ni}{yi} * hiyi * (1-hi)(ni - yi). The likelihood is the product of all examples' probability.
- Or we can use GD by defining the cost function of each example, and then minimizing the total cost function J:
cost(h,y) = [-ln(h) if y=1 and -ln(1-h) if y=0] = -y * ln(h) - (1-y) * ln(1-h) and J(θ) = (1/m) sum{i}{1}{m}(cost)
derivative{θj}(J(θ)) = the same as this of Linear Regression (but h(i) is different).
θ = θ - (α/m) * (XT * diff) with diff = h - y.
Maximum Likelihood Estimation
- The likelihood function is the joint probability distribution of observed data expressed as a function of parameters.
- The likelihood function has the same form as the PMF (in case of discrete inputs) and the PDF (in case of continuous inputs), but is a function of parameters, not y.
Regularization
- Used to fight overfitting: keep all features but reduce magnitude/values of parameters θj.
- Underfitting: high bias; overfitting: high variance.
- We need to have a regularization parameter λ so that when we minimize the cost function, we would need to decrease values of θj's more than usual. Like a penalization.
- We should not regularize the parameter θ0.
- Setting λ too large might result in the algorithm being underfit (all θj might get to 0).
Use L2 Reg. (lambda squared):
- Gradient Descent: This is usually true: 1 - α * λ / m < 1, it is usually a bit < 1.
- Linear Regression - Normal Equation:
- θ = (XTX + λL)-1XTy with L a diagonal matrix of size (n+1)x(n+1) with all diagonal elements being 1, except the first one.
- XTX + λL is invertible.