ML-Coursera

Gradient Descent

θ_j = θ_j - α * derivative{θ_j}(J(θ))

Feature scaling:

Mean normalization: Replace x_i with x_i - μ_i to make feature have approximately zero mean.

Linear Regression

derivative{θ_j}(J(θ)) = (1/m) sum{i}{1}{m}((h⁽ⁱ⁾ - y⁽ⁱ⁾) * x_j⁽ⁱ⁾)
θ = θ - (α/m) * (X^T * diff) with diff = X * θ - y.

Normal equation (analytic method):

θ = (X^TX)^-1X^Ty
Slow if n (# of features) is very large (10k can still be fine, GD can be more suitable for more than that).
X^TX might be noninvertible when: some features are linearly dependent, or there are too many features (n ≥ m)

Logistic Regression

h_θ(X) = 1/(1 + e^-θX) and θX = ln(h/(1-h)).
The odds is the ratio between the amounts staked by parties to a bet. Here the odds is h/(1-h).
The likelihood of h is a function of θ, which shows the probability that the classification is correct in all examples, based on θ. In the discrete case, each example is a binomial count, so the probability of each example is the PMF that there are y_i correct classification out of n_i samples in that example, which is C{n_i}{y_i} * h_i^y_i * (1-h_i)^{(n_i - y_i)}. The likelihood is the product of all examples' probability.
Or we can use GD by defining the cost function of each example, and then minimizing the total cost function J:
cost(h,y) = [-ln(h) if y=1 and -ln(1-h) if y=0] = -y * ln(h) - (1-y) * ln(1-h) and J(θ) = (1/m) sum{i}{1}{m}(cost)
derivative{θ_j}(J(θ)) = the same as this of Linear Regression (but h⁽ⁱ⁾ is different).
θ = θ - (α/m) * (X^T * diff) with diff = h - y.

Maximum Likelihood Estimation

The likelihood function is the joint probability distribution of observed data expressed as a function of parameters.
The likelihood function has the same form as the PMF (in case of discrete inputs) and the PDF (in case of continuous inputs), but is a function of parameters, not y.

Regularization

Used to fight overfitting: keep all features but reduce magnitude/values of parameters θ_j.
Underfitting: high bias; overfitting: high variance.
We need to have a regularization parameter λ so that when we minimize the cost function, we would need to decrease values of θ_j's more than usual. Like a penalization.
We should not regularize the parameter θ₀.
Setting λ too large might result in the algorithm being underfit (all θ_j might get to 0).

Use L2 Reg. (lambda squared):

Gradient Descent: This is usually true: 1 - α * λ / m < 1, it is usually a bit < 1.
Linear Regression - Normal Equation:
- θ = (X^TX + λL)^-1X^Ty with L a diagonal matrix of size (n+1)x(n+1) with all diagonal elements being 1, except the first one.
- X^TX + λL is invertible.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ex1		ex1
ex2		ex2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
advOptimization.m		advOptimization.m
squareThis.m		squareThis.m

Provide feedback