Machines Learning 学习笔记(Week1&2)

4156 단어 MachineLearning

Supervised Learning



Linear Regression Model



1.Hypothesis Function: (the function to best fit in the training set)
假设函 수/가정 함수
h_\theta(x) =  \theta_0 + \theta_1 x_1

$\theta_i$'s: Parameters

2.Cost Function: (to measure the performance/accuracy of the hypothesis function)
大价函数/목적함수
J(\theta_0,\theta_1) = 
\frac{1}{2m}
\sum_{i=1}^m 
\Bigl(
h_\theta (x_i) - y_i
\Bigr)
^2

$m$ is called a training set (or the # of training examples)
$(x_i , y_i)$ is called a training example

3.Gradient Descent: (the algorithm used to find the best parameters $\theta$ = minimize the cost function J)
사도 하강법/최급강하법
\theta_j
:=
\theta_j - 
\alpha
\frac{d}{d \theta_j}
J(\theta_0, \theta_1)

$\alpha$ is the learning rate (the step size of descent)
$\frac{d}{d\theta_j}J(\theta_0,\theta_1)$ is the partial derivative
  • At each iteration j, one should simultaneously update the parameters $\theta_1,
    \theta_2, ...,\theta_n$
  • In gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time.
  • Note that, while gradient descent can be susceptible to local minima in general, the optimization problem for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.
  • This method looks at every example in the entire training set on every step, and is called batch gradient descent.

  • Multivariate Linear Regression



    1. Hyphthesis Function:
    \begin{align}
    h_\theta(x) 
    &= \theta_0 x_0 + \theta_1 x_1 + ... +\theta_n x_n \\
    &= \theta^T x
    \end{align}
    

    This is a vectorization of our hypothesis function.

    2. Gradient Descent for Multiple Variables:

    repeat until convergence: {
    \theta_j :=
    \theta_j - \alpha
    \frac{1}{m}
    \sum_{i=1}^m 
    \Bigl(
    h_\theta (x^{(i)}) - y^{(i)}
    \Bigr)
    ⋅ x^{(i)}_j
    

    for j := 0...n

    3. Gradient Descent in Practice:
  • Feature Scaling and Mean Normalization:
  • 
    x_i := \frac {x_i - \mu_i}{s_i} 
    
    

    Where
    $\mu_i$ is the average of all the values ​​for feature (i)
    $s_i$ is the range of values ​​(max - min), or the standard deviation.

    We can speed up gradient descent by having each of our input values ​​in roughly the same range.
    Ideally,
    $-1\leq x^{(i)}\leq 1 $,
    or
    $-0.5\leq x^{(i)}\leq 0.5$
  • Debugging gradient descent

  • Plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
    It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
    To choose $\alpha$, try
    ..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...
    
  • Features and Polynomial Regression:
    (Methods to improve hypothesis function)
  • Combine multiple features into one.
  • Change the behavior or curve by making it a quadratic, cubic, or square root function.


  • Normal Equation



    An alternative way of minimizing J (cost function).
    \theta = (X^T X) ^{-1} X^Ty
    

    e.g.

  • Octave code:
  • pinv(X'*X)*X'*y
    
  • There is no need to do feature scaling with the normal equation

  • When to use Normal Equation vs. Gradient Descent: (n is # of features $x$)
  • if n < 1,000, better use NE
  • if n too large (usually when n>10K), GD is better

  • Not applicable to more complex algorithms (classification algorithm, logistic regression algorithm...)
  • 좋은 웹페이지 즐겨찾기