Machines Learning 学习笔记(Week1&2)

Supervised Learning

Linear Regression Model

1.Hypothesis Function: (the function to best fit in the training set)
假设函 수/가정 함수

h_\theta(x) =  \theta_0 + \theta_1 x_1

$\theta_i$'s: Parameters

2.Cost Function: (to measure the performance/accuracy of the hypothesis function)
大价函数/목적함수

J(\theta_0,\theta_1) = 
\frac{1}{2m}
\sum_{i=1}^m 
\Bigl(
h_\theta (x_i) - y_i
\Bigr)
^2

$m$ is called a training set (or the # of training examples)
$(x_i , y_i)$ is called a training example

3.Gradient Descent: (the algorithm used to find the best parameters $\theta$ = minimize the cost function J)
사도 하강법/최급강하법

\theta_j
:=
\theta_j - 
\alpha
\frac{d}{d \theta_j}
J(\theta_0, \theta_1)

$\alpha$ is the learning rate (the step size of descent)
$\frac{d}{d\theta_j}J(\theta_0,\theta_1)$ is the partial derivative

At each iteration j, one should simultaneously update the parameters $\theta_1,
\theta_2, ...,\theta_n$

In gradient descent as we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time.

Note that, while gradient descent can be susceptible to local minima in general, the optimization problem for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

Multivariate Linear Regression

1. Hyphthesis Function:

\begin{align}
h_\theta(x) 
&= \theta_0 x_0 + \theta_1 x_1 + ... +\theta_n x_n \\
&= \theta^T x
\end{align}

This is a vectorization of our hypothesis function.

2. Gradient Descent for Multiple Variables:

repeat until convergence: {

\theta_j :=
\theta_j - \alpha
\frac{1}{m}
\sum_{i=1}^m 
\Bigl(
h_\theta (x^{(i)}) - y^{(i)}
\Bigr)
⋅ x^{(i)}_j

for j := 0...n

3. Gradient Descent in Practice:

Feature Scaling and Mean Normalization:


x_i := \frac {x_i - \mu_i}{s_i}

Where
$\mu_i$ is the average of all the values for feature (i)
$s_i$ is the range of values (max - min), or the standard deviation.

We can speed up gradient descent by having each of our input values in roughly the same range.
Ideally,
$-1\leq x^{(i)}\leq 1 $,
or
$-0.5\leq x^{(i)}\leq 0.5$

Debugging gradient descent

Plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
To choose $\alpha$, try

..., 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...

Features and Polynomial Regression:
(Methods to improve hypothesis function)

Combine multiple features into one.

Change the behavior or curve by making it a quadratic, cubic, or square root function.

Normal Equation

An alternative way of minimizing J (cost function).

\theta = (X^T X) ^{-1} X^Ty

e.g.

Octave code:

pinv(X'*X)*X'*y

There is no need to do feature scaling with the normal equation

When to use Normal Equation vs. Gradient Descent: (n is # of features $x$)

if n < 1,000, better use NE

if n too large (usually when n>10K), GD is better

Not applicable to more complex algorithms (classification algorithm, logistic regression algorithm...)

Reference

이 문제에 관하여(Machines Learning 学习笔记(Week1&2)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/CHrIs23436939/items/5621b9d94652966a343e

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다