<과목> 기계 학습 제 4 장 : 주성분 분석

<과목> 기계 학습

목차
제1장:선형 회귀 모델
제2장:비선형 회귀 모델
제3장: 물류 회귀 모델
제4장:주성분 분석
제 5 장 : 알고리즘 1 (k 이웃 방법 (kNN))
제6장: 알고리즘 2(k-means)
제7장: 서포트 벡터 머신

제4장:주성분 분석

주성분 분석이란?

다변량 데이터를 가진 구조를 더 적은 수의 지표로 압축

변량의 수를 줄이면 정보 손실을 가능한 한 낮추고 싶다.

소수 변수를 이용한 분석과 시각화 (2 · 3 차원의 경우) 실현

계수 벡터가 바뀌면 선형 변환 후의 값이 변화

정보의 양을 분산의 크기로 파악한다

선형 변환 후 변수의 분산이 최대가되는 사영 축을 탐색 선형 변환 후의 분산 90 주성분 분석

다음 제약 최적화 문제 해결

규범이 1이 되는 제약을 넣는다 (제약을 넣지 않으면 무한히 해가 있다)

(연습 4) scikit learn을 사용한 유방암 검사 데이터의 차원 압축

+ 설정
+ 유방암 검사 데이터를 이용하여 로지스틱 회귀 모델을 작성
+ 주성분을 이용하여 2차원 공간에 차원 압축
+ 레코드 수 569 열 수 33
+ 과제
+ 32차원 데이터를 2차원으로 차원 압축할 때 잘 판별할 수 있는지 확인

Google 드라이브 맵을 사용하여 시작

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

sys.path 설정

아래에서는 Google 드라이브의 내 드라이브 바로 아래에 study_ai_ml 폴더를 만들고 진행합니다.

cancer_df = pd.read_csv('/content/drive/My Drive/study_ai_ml/data/cancer.csv')
print('cancer df shape: {}'.format(cancer_df.shape))

결과

cancer df shape: (569, 33)

cancer_df

cancer_df.drop('Unnamed: 32', axis=1, inplace=True)
cancer_df

・diagnosis: 진단 결과(양성이 B/악성이 M) ・설명 변수는 3열 이후, 목적 변수를 2열째로 로지스틱 회귀로 분류

# 目的変数の抽出
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)
# 説明変数の抽出
X = cancer_df.loc[:, 'radius_mean':]
# 学習用とテスト用でデータを分離
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ロジスティック回帰で学習
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

# 検証
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

결과

Train score: 0.988
Test score: 0.972
Confustion matrix:
[[89  1]
 [ 3 50]]

· 검증 점수 97 %로 분류 할 수 있는지 확인

pca = PCA(n_components=30)
pca.fit(X_train_scaled)
plt.bar([n for n in range(1, len(pca.explained_variance_ratio_)+1)], pca.explained_variance_ratio_)

# PCA
# 次元数2まで圧縮
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

# 寄与率
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

# 散布図にプロット
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') # 良性は○でマーク
plt.scatter(x=m[0], y=m[1], marker='^') # 悪性は△でマーク
plt.xlabel('PC 1') # 第1主成分をx軸
plt.ylabel('PC 2') # 第2主成分をy軸

# 標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ロジスティック回帰で学習
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

# 検証
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

결과

Train score: 0.927
Test score: 0.944
Confustion matrix:
[[87  3]
 [ 5 48]]

· 검증 점수 94%로 분류할 수 있는지 확인
차원수를 2로 낮춰도 검증 점수가 97%에서 94%로 별로 떨어지지 않고 정밀도를 유지하면서 차원수 삭감이 나온 결과가 되었다.

관련 사이트

제1장:선형 회귀 모델
제2장:비선형 회귀 모델
제3장: 물류 회귀 모델
제4장:주성분 분석
제 5 장 : 알고리즘 1 (k 이웃 방법 (kNN))
제6장: 알고리즘 2(k-means)
제7장: 서포트 벡터 머신

Reference

이 문제에 관하여(<과목> 기계 학습 제 4 장 : 주성분 분석), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/matsukura04583/items/b3b5d2d22189afc9c81c

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다