Skip to content

PhuNH/ML-Coursera

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML-Coursera

Gradient Descent

θj = θj - α * derivative{θj}(J(θ))

Feature scaling:

  • Features being on similar scales is better for GD, making it converge faster.
  • Rule of thumbs: change scales to between [-1/3; 1/3] and [-3; 3]

Mean normalization: Replace xi with xi - μi to make feature have approximately zero mean.

Linear Regression

derivative{θj}(J(θ)) = (1/m) sum{i}{1}{m}((h(i) - y(i)) * xj(i))
θ = θ - (α/m) * (XT * diff) with diff = X * θ - y.

Normal equation (analytic method):

  • θ = (XTX)-1XTy
  • Slow if n (# of features) is very large (10k can still be fine, GD can be more suitable for more than that).
  • XTX might be noninvertible when: some features are linearly dependent, or there are too many features (n ≥ m)

Logistic Regression

  • hθ(X) = 1/(1 + e-θX) and θX = ln(h/(1-h)).
  • The odds is the ratio between the amounts staked by parties to a bet. Here the odds is h/(1-h).
  • The likelihood of h is a function of θ, which shows the probability that the classification is correct in all examples, based on θ. In the discrete case, each example is a binomial count, so the probability of each example is the PMF that there are yi correct classification out of ni samples in that example, which is C{ni}{yi} * hiyi * (1-hi)(ni - yi). The likelihood is the product of all examples' probability.
  • Or we can use GD by defining the cost function of each example, and then minimizing the total cost function J:
    cost(h,y) = [-ln(h) if y=1 and -ln(1-h) if y=0] = -y * ln(h) - (1-y) * ln(1-h) and J(θ) = (1/m) sum{i}{1}{m}(cost)
    derivative{θj}(J(θ)) = the same as this of Linear Regression (but h(i) is different).
    θ = θ - (α/m) * (XT * diff) with diff = h - y.

Maximum Likelihood Estimation

  • The likelihood function is the joint probability distribution of observed data expressed as a function of parameters.
  • The likelihood function has the same form as the PMF (in case of discrete inputs) and the PDF (in case of continuous inputs), but is a function of parameters, not y.

Regularization

  • Used to fight overfitting: keep all features but reduce magnitude/values of parameters θj.
  • Underfitting: high bias; overfitting: high variance.
  • We need to have a regularization parameter λ so that when we minimize the cost function, we would need to decrease values of θj's more than usual. Like a penalization.
  • We should not regularize the parameter θ0.
  • Setting λ too large might result in the algorithm being underfit (all θj might get to 0).

Use L2 Reg. (lambda squared):

  • Gradient Descent: This is usually true: 1 - α * λ / m < 1, it is usually a bit < 1.
  • Linear Regression - Normal Equation:
    • θ = (XTX + λL)-1XTy with L a diagonal matrix of size (n+1)x(n+1) with all diagonal elements being 1, except the first one.
    • XTX + λL is invertible.

About

Exercises and notes of ML course on Coursera

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages