Machines Learning 学习笔记(Week3)

7026 단어 MachineLearning

Supervised Learning



Classification Problem



1. Hypothesis Function:
  • "Sigmoid Function"or "Logistic Function":
    $g(z) =\frac{1}{1+e^{-z}} $
  • h_\theta(x) = g(\theta^Tx) = 
    \frac{1}{1 + e^{-\theta^Tx}}
    


  • Interpretation:
    the probability that our output is 1
    $ h_\theta (x) = P (y = 1 | x;\theta) = 1 - P (y = 0 | x;\theta) $

  • Decision Boundary:
  • when
    $h_\theta(x)\geq 0.5\rightarrow y=1, h_\theta(x) < 0.5\rightarrow y=0$
    means $g(z)\geq 0.5\rightarrow z\geq 0\rightarrow y=1 $
  • z is the input (e.g. $z =\theta^Tx$)
  • decision boundary could be any shape:




  • 2. Cost Function:
    J(\theta) = - \frac{1}{m}
    \sum_{i=1}^{m}
    \Bigl[y^{(i)}log\bigl(h_\theta(x^{(i)})\bigr)
    +(1-y^{(i)})log\bigl(1-h_\theta(x^{(i)})\bigr)
    \Bigr]
    

    Vectorized implementation:
    $h=g(X\theta)$
    $J(\theta)=\frac{1}{m}\bigl(-y^Tlog(h)-(1-y)^Tlog(1-h)\bigr)$

    3. Gradient Descent:
    Repeat {
    \theta_j := \theta_j- \frac{\alpha}{m}
    \sum_{i=1}^{m}
    \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_j^{(i)}
    

    }
    Vectorized implementation:
    $\theta :=\theta -\frac{\alpha}{m}X^T\bigl(g(X\theta)-\vec{y}\bigr)$

    Advanced Optimization


  • Optimization algorithms:
  • Gradient descent
  • Conjugate gradient
  • BFGS
  • L-BFGS


  • 코드:
    First, we need to provide a function that evaluates both
    $J(\theta)$, and
    $\frac{\alpha}{\alpha\theta_j}J(\theta)$
    function [jVal, gradient] = costFunction(theta)
      jVal = [...code to compute J(theta)...];
      gradient = [...code to compute derivative of J(theta)...];
    end
    

    Then we use "fminunc()"optimization algorithm along with the "optimset()"function that creates an object containing the options we want to send to "fminunc()".
    options = optimset('GradObj', 'on', 'MaxIter', 100);
    initialTheta = zeros(2,1);
       [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
    

    Multiclass Classification: One-vs-all



    Multiclass means y = {0,1, ... ,n}. Simply apply the same logistic algorithem to each class:

    Train a logistic regression classifier $h_\theta(x)$ for each class to predict the probability that  y = i .

    To make a prediction on a new x, pick the class that maximizes $h_\theta(x)$



    The Problem of Overfitting



    Too many features, too complicated function - high variance:


    1. Regularized Cost Function:
    regularize all of our theta parameters in a single summation:
    min_\theta \rightarrow J(\theta)= \frac{1}{2m}\Biggl[ 
    \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)^2 +
    \lambda\sum_{j=1}^n\theta_j^2  \Biggr]
    

    The $\lambda$, or lambda, is the regularization parameter.

    2. Regularized Gradient Descent:
  • Regularized Linear Regression:

  • Repeat {
    \theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
    
    \theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}
    

    }

    $\theta_j$ can also be represented as:
    $\theta_j :=\theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^m\bigl(h_\theta(x) ^{(i)})-y^{(i)})x_j^{(i)}$

    Intuitively, you can see it as reducing the value of $\theta_j$ by some amount on every update.
  • Normal Equation:

  • $\theta = (X^TX+\lambda L)^{-1}X^Ty$
    where $L$ =
    \begin{bmatrix}
    0& & & &  \\
     &1& & &  \\
     & &1& &  \\
     & & &\ddots& \\
     & & & &1\\
    \end{bmatrix}\qquad (n+1) \, x \, (n+1) \, dimension
    
  • Regularized Logistic Regression:

  • Repeat {
    \theta_0 := \theta_0 - \alpha\frac{1}{m} \sum_{i=1}^m \bigl(h_\theta(x^{(i)})-y^{(i)}\bigr)x_0^{(i)}
    
    \theta_j := \theta_j - \alpha \Biggl[ \Bigl(\frac{1}{m} \sum_{i=1}^{m} \bigl(h_\theta(x^{(i)})-y^{(i)} \bigr)x_j^{(i)}\Bigr)+\frac{\lambda}{m}\theta_j \Biggr]\qquad j\in {1,2...n}
    

    }

    Cost function (regularized):
    $J(\theta) = -\frac{1}{m}
    \sum_{i=1}^{m}
    \Bigl[y^{(i)}log\bigl(h_\theta(x^{(i)})\bigr)
    +(1-y^{(i)})log\bigl(1-h_\theta(x^{(i)})\bigr)
    \Bigr]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2$

    좋은 웹페이지 즐겨찾기