PCA(Principal Component Analysis) (feat. sklearn)

Pre-requistie

Singular value decomposition

  • Singular value decomposition(SVD) of XRn×pX\in\mathbb{R}^{n\times p}

sklearn

  • breast cancer data
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

Eigenvectors

  • SVD and Eigendecomposition
    • Note that VV^{\top} is the eigenvector of XXX^{\top}X, since
      XX=(UDV)(UDV)=VDUUDV=VDDV(U is orthogonal)\begin{aligned} X^{\top}X &= (UDV^{\top})^{\top}(UDV^{\top}) \\ &= VD^{\top}U^{\top}UDV^{\top} \\ &= VD^{\top}DV^{\top}\quad\quad{(\because U \text{ is orthogonal})} \end{aligned}
  • In sklearn,
from sklearn.decomposition import PCA
n_comp=3
pca=PCA(n_components=n_comp)
pca.fit(X_scaled)
  • Then, pca.components_ gives VV^{\top}, which is eigenvector of XXX^{\top}X
  • Using svd in numpy,
import numpy as np
U, S, VT = np.linalg.svd(X_scaled)
  • Check two results below gives the same value in sign!
pca.components_ # sklearn
VT[:n_comp] # svd

Principal components

  • The columns of UDUD are called the principal componentsprincipal\ components of XX
  • The principal components of a collection of points in a real coordinate space are a sequence of pp unit vectors, where the ii-th vector is the direction of a line that best fits the data while being orthogonal to the first i1i-1
  • In sklearn,
pca_fit_transform = pca.fit_transform(X_scaled)
  • Note that
    XV=UDVV=UD\begin{aligned} XV &= UDV^{\top}V \\ &= UD \end{aligned}
  • So, svd in numpy gives also the principal components!
(X_scaled).dot(pca.components_.T)
  • Check two results below gives the same value!
pca_fit_transform # sklearn
(X_scaled).dot(pca.components_.T) # svd

Projection of data onto the principal components

  • In sklearn,
pca_inverse_transform = pca.inverse_transform(pca_transform)
  • Note that XVVXVV^{\top} gives the projection of XX on VVVV^{\top}
    • VVVV^{\top} is the sum of projection matrix on eigenvectors v1,...vpv_1,...v_p
    • Let ii-th column vector of VV as viRPv_i\in\mathbb{R}^{P}
  • Using svd in numpy,
pca_transform.dot(pca.components_)

or

(X_scaled).dot(pca.components_.T).dot(pca.components_)
  • Check two results below gives the same value!
pca_inverse_transform # sklearn
pca_transform.dot(pca.components_) # svd
(X_scaled).dot(pca.components_.T).dot(pca.components_) # svd

  • Wrap-up
    • Given that
    from sklearn.decomposition import PCA
    n_comp=3
    pca=PCA(n_components=n_comp)
    pca.fit(X_scaled)
ItemPCA in sklearnsvd in numpy
Eigenvectorspca.components_VT in U, S, VT = np.linalg.svd(X_scaled)
Principal componentspca.fit_transform(X_scaled)(X_scaled).dot(pca.components_.T)
Projection onto the principal componentspca.inverse_transform(pca_transform)(X_scaled).dot(pca.components_.T).dot(pca.components_)

PCA projection recovery process

from sklearn.decomposision import PCA

n_comp = 330
pca = PCA(n_components = n_comp)
pca_fit_transform = pca.fit_transform(R.T)
pca_inverse_transform = pca.inverse_transfomr(pca_fit_transform)
  • Additional eigenvalues
    e~N(μW,ΣW)\tilde{e}\sim N (\mu_{\mathsf{W}}, \Sigma_{\mathsf{W}})
    where μW^=1ns=1Ses\hat{\mu_{\mathsf{W}}}=\frac{1}{n}\sum_{s=1}^{S}e_s
mu_hat_for_EV = list(map(lambda x : np.mean(x), COMPONENTS)
Sigma_hat_for_EV = np.cov(COMPONENTS)

S_new = 500
W_prime = np.random.multivariate_normal(mu_hat_for_EV, Sigma_hat_for_EV, S_new)
generated = np.matmul(pca_inverse_transform, W_prime.T)

[Reference]

좋은 웹페이지 즐겨찾기