Introduction to NLP (Wk.6)

Ch.8 Deep Learning

8-1) Perceptron

Introduction to Perception

The graph above is step function.
We send (input * weight) to artificial neuron, and if the sum of it exceeds the threshold, the artificial neuron at the end will return 1, and if not, return 0.

if \sum_i^{n} w_{i}x_{i}\ ≥ \theta → y=1

if \sum_i^{n} w_{i}x_{i}\ < \theta → y=0

\theta = threshold

We can move threshold value to left, and express it as b(bias).
It will also be used as input of perceptron as below.

if \sum_i^{n} w_{i}x_{i} + b ≥ 0 → y=1

if \sum_i^{n} w_{i}x_{i} + b < 0 → y=0

b is also a variable that deep learning should find the optimal value.

Activation Function: The function that changes return value in the neuron. Followings are also the activation function:

step function
sigmoid function
softmax function

The difference between artificial neuron that performs logistic regression & perceptron above is 'activation function'.

Artificial Neuron: activation function

f(\sum_i^{n} w_{i}x_{i} + b)

Perceptron (one of artificial neuron): step function

f(\sum_i^{n} w_{i}x_{i} + b)

Single-Layer Perceptron

In single-layer perceptron, there are only two steps: Input and Output. Each step is called 'layer'.

Sigle-layer perceptron for AND gate. (* there can be other various values for w1, w2, and b)

def AND_gate(x1, x2):
    w1 = 0.5
    w2 = 0.5
    b = -0.7
    result = x1*w1 + x2*w2 + b
    if result <= 0:
        return 0
    else:
        return 1

Sigle-layer perceptron for NAND gate.

def NAND_gate(x1, x2):
    w1 = -0.5
    w2 = -0.5
    b = 0.7
    result = x1*w1 + x2*w2 + b
    if result <= 0:
        return 0
    else:
        return 1

Sigle-layer perceptron for OR gate.

def OR_gate(x1, x2):
    w1 = 0.6
    w2 = 0.6
    b = -0.5
    result = x1*w1 + x2*w2 + b
    if result <= 0:
        return 0
    else:
        return 1

However, single-layer perceptron can only be used for data that we can classify with linear mehod.
Therefore, it is impossible to implement XOR gate with single-layer perceptron.
Refer to graphs below.

MultiLayer Perceptron (MLP)

We now add more layer between input layer & output layer.
It is called 'hidden layer'.

Below is how to implement XOR gate with MLP.

Below is MLP with two hidden layers.

We call neural network with more than 1 hidden layer as 'Deep Neural Network, DNN'. It is not only limited to multilayer perceptron, but any neural network with 2 or more hidden layers is also included in DNN.

In machine learning, we don't put weight and bias manually.
We have to automate the process so that the machine can find the optimal value by itself, and this is called 'training'.
We use loss function and optimizer for it.
If the neural network is DNN, then we call it 'deep learning'.

8-2) Artificial Neural Network

Feed-Forward Neural Network (FFNN)

Above is feed-forward neural network.

Above is recurrent neural network (RNN).
Hidden layer's output can be sent to output layer, but also can be reused for hidden layer's input again.

Fully-Connected Layer (FC)

It is also called as Dense layer.
FC is the layer whose all neuron is connected to all neuron of previous layer.

If there is a feed-forward neural network consists of FCs, we call it 'Fully-connected FFNN'.

Activation Function

Nonlinear Function

Adding hidden layer with linear function multiple time is meaningless. It has same meaning of adding it once.
Therefore, we usually use nonlinear layer for hidden layer.

Step Function

It is not used frequently nowadays.

Sigmoid Function

Artificial neural network will do 'forward propagation' for the given input, calculate gradient using differention for loss value, and perform back propagation.

Vanishing Gradient

When differentiating the sigmoid function's orange part above, the gradient that is very small will be multiplied.
Then, the gradient cannot be propagated to layers in front end.
Which makes w not being updated, thus learning does not proceed.

Therefore, using sigmoid function in hidden layer should be refrained from.

Hyperbolic Tangent Function

ReLU Function

Leaky ReLU Function

Softmax Function

8-3) Matrix Multiplication

number of parameters (w + b) can be expressed with matrix multiplication (also known as matrix product) as below.

8-4) Learning Method

Loss Function

We usually use MSE for regression, and Cross-Entropy for classification.
The purpose of deep learning is to find optimal values for w and b that minimize the value of loss function. Therefore, which loss function to use is very important.

Mean Squared Error (MSE)

Average of (error)^2.
We use it for predicting continuous parameter.

Cross-Entropy

If the prediction is correct with low possibility, or is wrong with high possibility, the loss value gets bigger.

Optimizer

Batch: amount of data for optimizing parameters. (It can be the entire data, or can be specific amount)

Epoch: number of training

Batch Gradient Descent

Update every parameter once per epoch.
It takes a lot of memory, but can find the global minimum.

Stochastic Gradient Descent (SGD)

Only optimizing random one data instead of optimizing every data.
It uses less data, thus less time but less accuracy.

Mini-Batch Gradient Descent

Optimizing pre-determined number of data instead of every or one data.
It is faster than BGD, and accurate(stable) than SGD.
It is widely used gradient descent method.

Momentum

It prohibits computer misconceive local minimum as global minimum.

Adagrad

Variable with large change has lower learning rate.
Variable with small change has higher learning rate.

RMSprop

After Adagrad, the learning rate may descend too much.
Instead, with replacing it with other function, it can be improved.

Adam

RMSprop + Momentum.
It can be used for not only direction but also learning rate.

8-3) BackPropagation

refer to 3-4. (In this velog, refer to Wk.4)

8-4) Epochs and Batch size and Iteration

Epoch

When all propagation is done
If the epoch is too big, there can be overfitting.
If the epoch is too small, there can be underfitting.

Batch Size

unit of data for updating parameters

Iteration

number of batch
SGD's batch size is 1, therefore it chooses one data for gradient descending per every iteration.

8-5) How to Prevent Overfitting

Introduction to Overfitting

Overfitting signifies that the machine also learned noises from learning data as well.
Therefore, it may work good for the train data, but can work bad for the test/new data.

Increasing Data

If we give more data, model can learn general pattern instead of noise or specific pattern of data.

If the data is too small, we can change existing data a little, and add them. It is called 'Data Augmentation'.
In image processing field, it is widely & actively used.

Simplifying Model

Complexity of model is determined by number of hidden layers and parameters.
We can prevent overfitting by reducing it.

We call number of parameters of model as 'model's capacity'.

Applying Weight Regularization

L1 regulation: add sum of absolute values of weight w values to the cost function. It is also known as 'L1 Norm'.
L2 regulation: add sum of sqare values of weight w values to the cost function. It is also known as 'L2 Norm'.

Dropout

We can randomly not use certain number of neuron for each learning process.

8-6) Gradient Vanishing and Exploding

Introduction to Gradient Vanishing and Exploding

If the gradient gradually gets lower and weights of layers close to input layer may not be updated well. Thus not able to find optimal model. It is called 'gradient vanishing'.

Opposite case also exists. Gradient can gradually gets bigger, and weights can be excessively bigger. It can happen in RNN.

Using ReLU and its Variations

Use ReLU or its variation functions like Leaky ReLU instead of sigmoid or hyperbolic tangent function for hidden layers.

Gradient Clipping

We can clip gradient value so that it does not exceeds threshold to prevent gradient exploding.

Weight Initialization

Model's learning result may be differ by what were the initial values of weight values.
Therefore, initializing weight can mitigate gradient vanishing and exploding.