Machines Learning 学习笔记(Week9)

7226 단어 MachineLearning

Anomaly Detection



Algorithm (based on Gaussian Distribution)



1, Choose features $x_i$ that you think might be indicative of anomalous examples

2, Fit parameters $\mu_1,...,\mu_n,\sigma_1^2,...,\sigma_n^2$
\mu_j = \frac{1}{m}\sum_{i=1}^mx_j^{(i)}\\
\sigma_j^2 = \frac{1}{m}\sum_{i=1}^m(x_j^{(i)}-\mu_j)^2

3, Given new examples x, compute $p(x)$:
p(x) = \prod_{j=1}^n p(x_j;\mu_j,\sigma_j^2) = \prod_{j=1}^n \frac{1}{\sqrt{2\pi}\sigma_j}exp(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2})

Anomaly if $p(x)<\varepsilon$

개발 및 평가



Assume we have some labeled data, of anomalous and non-anomalous examples (y=0 if normal, y=1 if anomalous).

Training set(60%): $x^{(1)},x^{(2)},...,x^{(m)}$(assume normal examples/not anomalous)

Cross validation set(20%):$(x_{cv}^{(1)},y_{cv}^{(1)}),...,(x_{cv}^{(m_{cv}) },y_{cv}^{(m_{cv})})$

Test set(20%):$(x_{test}^{(1)},y_{test}^{(1)}),...,(x_{test}^{(m_{test})} ,y_{test}^{(m_{test})})$

Aircraft engines example

10000 good (normal) engines, 20 flawed engines (anomalous)

Training set: 6000 good engines
CV set: 2000 good engines, 10 anomalous
Test set: 2000 good engines, 10 anomalous

평가



Fit model p(x) on the training set
On a cross validation set, predict:
- y=1 if p(x)<ε (anomaly)
- y=0 if p(x)>=ε (normal)
Possible evaluation metrics:
- Precision/Recall
- F1 Score
Can also use cross validation set to choose parameter ε

Anomaly Detection vs. Supervised Learning




Anomaly Detection
Supervised Learning


Very small number of positive examples(y=1). (0-20 is common). Large number of negative examples(y=0).
Large number of positive and negative examples.

Many different "types"of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look like nothing like any of the anomalous examples we've seen so far.
Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.


Non-gaussian features



Transform the data: $log(x),\,log(x+c),\,x^{\frac{1}{2}},\,x^{\frac{1}{3}}.. .$



Multivariate Gaussian Distribution (Optional)



Purpose: capture correlations between different features in case of unusual combinations of features

Anomaly detection with the multivariate Gaussian



1, Fit model p(x) by setting
\mu = \frac{1}{m}\sum_{i=1}^mx^{(i)}\\
\Sigma = \frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)(x^{(i)}-\mu)^T\\
(\mu \in R^n,\,\Sigma\in R^{n*n})

2, Given a new example x, compute
p(x) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}exp\Bigl(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\Bigr)\\
(|\Sigma|=determinant\,of\,\Sigma)

Flag an anomaly if $p(x) <\varepsilon$

Relationship with original model


  • $\Sigma$ measures the correlations between features. If Sigma is all 0s except the diagonal, then it's identical to the original Gaussian model.
  • Change the 0s to positive numbers, means 正相关
  • Similarly, change the 0s to negative numbers, means 负相关

  • Original Model vs. Multivariate Gaussian





    Recommender Systems





    $r(i,j)=1$ if user j has rated movie i (0 otherwise)
    $y^{(i,j)}=$ rating by user j on movie i (if defined)
    $\theta^{(j)}=$ parameter vector for user j
    $x^{(i)}=$ feature vector for movie i
    For user j, movie i, predicted rating: $(\theta^{(j)})^T(x^{(i)})$
    $n_u=$ no. users
    $n_m=$ no. movies
    $m^{(j)}=$ no. of movies rated by user j

    Content-based Recommendations



    We know what the content is and how the content is. We are trying to predict users' preferences.

    Optimization algorithm:


    \min_{\theta^{(1)},...,\theta^{(n_u)}} \frac{1}{2}\sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} \Bigl((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\Bigr)^2+\frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n(\theta_k^{(j)})^2\\
    (\theta^{(j)} \in R^{n+1})
    

    Gradient descent update:





    Collaborative Filtering



    We don't know the content or the users' preferences. But we are trying to:


    $x\in R^n$
    $\theta\in R^n$
    Don't need $x_0$ and $\theta_0$

    Collaborative filtering algorithm





    Vectorization: Low rank matrix factorization





    Find related products (movies)



    For each product i, we learn a feature vector $x^{(i)}\in R^n$.

    Find the products (movies) j with the smallest $||x^{(i)}-x^{(j)}||$.

    Mean Normalization



    For users who have not rated any movies, cannot predict initial values ​​(all 0s):


    좋은 웹페이지 즐겨찾기