[Kaggle]Titanic - Machine Learning from Disaster
Titanic - Machine Learning from Disaster
Kaggle을 시작하게 되면 가장 먼저 혹은 쉽게 접할 수 있는 대회가 바로 Titanic 대회입니다. 저도 Kaggle이라는 데이터 분석 사이트를 접하게 되면서 처음 접했던 대회가 Titanic 이였고 다시 kaggle을 시작했기에 다시 한번 작성해보는 시간을 가지게 되었습니다.
결과적으로 Kaggle의 Public Leaderboard에는 0.79904로 2613등을 하게 되었으며 Public Leaderboard 기준으로는 상위 5% 정도로 생각됩니다. 데이콘에서도 Titanic 대회가 똑같이 존재해서 확인해보았는데, 0.778870의 accuracy를 확인할 수 있었습니다.
모델링 부분이 굉장히 부족하여 여러 노트북을 참고하였는데, A Data Science Framework: To Achieve 99% Accuracy 노트북을 정말 많이 참고하여 작성하게 되었습니다.
import os, sys
import glob
import zipfile
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
plt.style.use('seaborn') # seaborn 스타일로 변환
sns.set(rc={'figure.figsize' : (15,7)})
plt.rc('font', family='AppleGothic')
plt.rc('axes', unicode_minus=False)
warnings.filterwarnings('ignore')
0. 대회 설명
- 대회 : https://www.kaggle.com/c/titanic
- 주제 : predicts which passengers survived the trainanic shipwreck
- 문제 정의 : 어떤 특징의 승객이 살아남을 확률이 높을 것인가
- Data Description
- survival: 생존 여부 (0 = No, 1 = Yes)
- pclass: 티켓 등급 (1 = 1st, 2 = 2nd, 3 = 3rd)
- sex: 성별
- Age: 나이
- sibsp: 동행한 형재자매 / 배우자
- parch: 동행한 부모 / 자녀
- ticket: 티켓 번호
- fare: 요금
- cabin: 객실 번호
- embarked Port of Embarkation: 선착장 (C = Cherbourg, Q = Queenstown, S = Southampton)
1. Data Load
!kaggle competitions download -c titanic
titanic.zip: Skipping, found more recently modified local copy (use --force to force download)
os.listdir()
['.DS_Store',
'Titanic.png',
'titanic.zip',
'.ipynb_checkpoints',
'data',
'Titanic.ipynb']
unzip = zipfile.ZipFile('titanic.zip')
unzip.extractall(path = 'data')
os.listdir('./data/')
['test.csv',
'submission_soft.csv',
'train.csv',
'gender_submission.csv',
'submission_hard.csv']
train = pd.read_csv(os.path.join('data', 'train.csv'))
test = pd.read_csv(os.path.join('data', 'test.csv'))
train.head()
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
train.describe()
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
train.nunique()
PassengerId 891
Survived 2
Pclass 3
Name 891
Sex 2
Age 88
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 147
Embarked 3
dtype: int64
train.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
2. EDA
2-1. label - Survived
# label - Survived 사망(0) / 생존(1) 비율
# 0, Dead / 1, Survived
f, ax = plt.subplots(1, 2, figsize=(15,8))
train['Survived'].value_counts().plot.pie(rot = 0, ax = ax[0])
ax[0].legend(['Dead', 'Survived'])
train['Survived'].value_counts().plot.bar(rot = 0, ax = ax[1])
ax[1].set_xticklabels(labels = ['Dead', 'Survived'])
plt.show()
- survival: 생존 여부 (0 = No, 1 = Yes)
- pclass: 티켓 등급 (1 = 1st, 2 = 2nd, 3 = 3rd)
- sex: 성별
- Age: 나이
- sibsp: 동행한 형재자매 / 배우자
- parch: 동행한 부모 / 자녀
- ticket: 티켓 번호
- fare: 요금
- cabin: 객실 번호
- embarked Port of Embarkation: 선착장 (C = Cherbourg, Q = Queenstown, S = Southampton)
!kaggle competitions download -c titanic
titanic.zip: Skipping, found more recently modified local copy (use --force to force download)
os.listdir()
['.DS_Store',
'Titanic.png',
'titanic.zip',
'.ipynb_checkpoints',
'data',
'Titanic.ipynb']
unzip = zipfile.ZipFile('titanic.zip')
unzip.extractall(path = 'data')
os.listdir('./data/')
['test.csv',
'submission_soft.csv',
'train.csv',
'gender_submission.csv',
'submission_hard.csv']
train = pd.read_csv(os.path.join('data', 'train.csv'))
test = pd.read_csv(os.path.join('data', 'test.csv'))
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
train.nunique()
PassengerId 891
Survived 2
Pclass 3
Name 891
Sex 2
Age 88
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 147
Embarked 3
dtype: int64
train.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
2. EDA
2-1. label - Survived
# label - Survived 사망(0) / 생존(1) 비율
# 0, Dead / 1, Survived
f, ax = plt.subplots(1, 2, figsize=(15,8))
train['Survived'].value_counts().plot.pie(rot = 0, ax = ax[0])
ax[0].legend(['Dead', 'Survived'])
train['Survived'].value_counts().plot.bar(rot = 0, ax = ax[1])
ax[1].set_xticklabels(labels = ['Dead', 'Survived'])
plt.show()
# label - Survived 사망(0) / 생존(1) 비율
# 0, Dead / 1, Survived
f, ax = plt.subplots(1, 2, figsize=(15,8))
train['Survived'].value_counts().plot.pie(rot = 0, ax = ax[0])
ax[0].legend(['Dead', 'Survived'])
train['Survived'].value_counts().plot.bar(rot = 0, ax = ax[1])
ax[1].set_xticklabels(labels = ['Dead', 'Survived'])
plt.show()
2-2. Feature distribution
# categorical feature에 대한 countplot
f, ax = plt.subplots(2,3, figsize = (20, 15))
columns = ['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
q = 0
for i in range(2):
for j in range(3):
fig = sns.countplot(x = train[columns[q]], ax = ax[i][j])
q += 1
# continuous feature에 대한 countplot
f, ax = plt.subplots(2,1, figsize = (15, 10))
continuous_columns = ['Age', 'Fare']
train.Age.hist(bins = 70, ax = ax[0])
ax[0].set_title('Age distribution')
train.Fare.hist(bins = 70, ax = ax[1])
ax[1].set_title('Fare distribution')
plt.show()
2-3. Sex
# 성별 사망 비율
f, ax = plt.subplots(1, 2, figsize=(15,8))
train.loc[train['Sex'] == 'male', 'Survived'].value_counts().sort_index().plot.bar(rot = 0, ax = ax[0], color = ['tab:blue', 'tab:orange'])
ax[0].set_title('male')
ax[0].set_xticklabels(['Dead', 'Survived'])
train.loc[train['Sex'] == 'female', 'Survived'].value_counts().sort_index().plot.bar(rot = 0, ax = ax[1], color = ['tab:blue', 'tab:orange'])
ax[1].set_title('female')
ax[1].set_xticklabels(['Dead', 'Survived'])
plt.show()
2-4. P_class
# P_class 별 생존여부
pd.pivot_table(train, index = 'Pclass', columns = 'Survived', values = 'Name', aggfunc='count', fill_value=0)
# pd.crosstab(train['Pclass'], train['Survived']) # 똑같은 결과
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
# Pclass가 3인 경우, 죽은 인원과 비율이 굉장히 많고 높음
# 보통 Pclasss는 남성인 경우가 많지만 생존된 비율은 여성이 더 높음
sns.countplot(data = train.loc[train['Pclass'] == 3], x = 'Sex', hue = 'Survived')
plt.show()
# Pclass $ sex 별 survived 분포
# Pclass 3->1 으로 갈수록 남성이 생존하는 비율이 높아지고
# Pclass 1 인 경우에는 여성이 사망하는 경우가 거의 없음
# 결론적으로, Pclass는 좀 더 고위층인 느낌인 들며 Survived(0/1)에 생각보다 영향을 많이 미치는 것 같음
sns.catplot(x = "Pclass", y = "Survived", hue = "Sex", row = "Sex", data = train,
kind = "violin", split = True, height = 3, aspect = 4)
plt.show()
2-5. Age
# Pclass별 차이 확인
# Pclass & sex 별 나이 분포도
# Pclass 1->3 으로 갈수록 나이 분포가 점차 낮아지는 것을 확인 가능
f, ax = plt.subplots(3,2, figsize = (20, 15))
train.loc[(train['Pclass'] == 3) & (train['Sex'] == 'male'), 'Age'].hist(bins = 30, ax = ax[0][0])
train.loc[(train['Pclass'] == 3) & (train['Sex'] == 'female'), 'Age'].hist(bins = 30, ax = ax[0][1])
ax[0][0].set_title('Pclass 3 & male')
ax[0][1].set_title('Pclass 3 & female')
train.loc[(train['Pclass'] == 2) & (train['Sex'] == 'male'), 'Age'].hist(bins = 30, ax = ax[1][0])
train.loc[(train['Pclass'] == 2) & (train['Sex'] == 'female'), 'Age'].hist(bins = 30, ax = ax[1][1])
ax[1][0].set_title('Pclass 2 & male')
ax[1][1].set_title('Pclass 2 & female')
train.loc[(train['Pclass'] == 1) & (train['Sex'] == 'male'), 'Age'].hist(bins = 30, ax = ax[2][0])
train.loc[(train['Pclass'] == 1) & (train['Sex'] == 'female'), 'Age'].hist(bins = 30, ax = ax[2][1])
ax[2][0].set_title('Pclass 1 & male')
ax[2][1].set_title('Pclass 1 & female')
plt.suptitle('Pclass and Sex Age Distribution', fontsize = 20)
plt.show()
# boxplot 확인 결과 확실히 Pclass 낮을수록 연령대가 높음
sns.boxplot(x="Pclass", y="Age", data=train, whis=np.inf)
plt.show()
2-6. Cabin
# Cabin: 객실 번호 a small room where you sleep in a ship
# 선실의 종류를 의미하는 것 같기 때문에 Pclass와 같이 보면 좋을 것 같음
train['Cabin'].fillna('X').apply(lambda x : x[:1]).value_counts().plot.bar(rot = 0)
plt.show()
# NaN 제외 Cabin 분포
data = []
train.loc[train['Cabin'].notnull(), 'Cabin'].apply(lambda x : data.extend(x[:1]))
pd.Series(data).value_counts().sort_index().plot.bar(rot = 0)
plt.show()
Pclass_cabin = train.loc[train['Cabin'].notnull(), ['Survived', 'Pclass', 'Cabin', 'Fare']]
Pclass_cabin['Cabin'] = Pclass_cabin['Cabin'].apply(lambda x : x[:1])
Pclass_cabin.head()
Survived | Pclass | Cabin | Fare | |
---|---|---|---|---|
1 | 1 | 1 | C | 71.2833 |
3 | 1 | 1 | C | 53.1000 |
6 | 0 | 1 | E | 51.8625 |
10 | 1 | 3 | G | 16.7000 |
11 | 1 | 1 | C | 26.5500 |
# 흠.. 모집단이 너무 작아 확실한 결론을 내리기가 애매하지만..
# 일단 Pclass 1은 F, G, T 에는 거의 없음
pd.pivot_table(Pclass_cabin, index = 'Pclass', columns = 'Cabin', values = 'Survived', aggfunc = 'count')
Cabin | A | B | C | D | E | F | G | T |
---|---|---|---|---|---|---|---|---|
Pclass | ||||||||
1 | 15.0 | 47.0 | 59.0 | 29.0 | 25.0 | NaN | NaN | 1.0 |
2 | NaN | NaN | NaN | 4.0 | 4.0 | 8.0 | NaN | NaN |
3 | NaN | NaN | NaN | NaN | 3.0 | 5.0 | 4.0 | NaN |
# 확실히 Pclass가 높을수록 생존 가능성이 높다는 가설이 맞는것...같은...
# 그렇다면 비어있는 cabin에 대한 처리를 어떤식으로 할 수 있을까!
# 만약 Cabin 이 선실에 대한 의미이면 Fare(요금?)이랑 연관이 있지 않을까?!
pd.pivot_table(Pclass_cabin, index = 'Survived', columns = 'Cabin', values = 'Pclass', aggfunc = 'count')
Cabin | A | B | C | D | E | F | G | T |
---|---|---|---|---|---|---|---|---|
Survived | ||||||||
0 | 8.0 | 12.0 | 24.0 | 8.0 | 8.0 | 5.0 | 2.0 | 1.0 |
1 | 7.0 | 35.0 | 35.0 | 25.0 | 24.0 | 8.0 | 2.0 | NaN |
# 오오.. 확실히 Pclass 1의 Cabin의 fare가 높음
pd.pivot_table(Pclass_cabin, index = 'Pclass', columns = 'Cabin', values = 'Fare', aggfunc = np.mean)
Cabin | A | B | C | D | E | F | G | T |
---|---|---|---|---|---|---|---|---|
Pclass | ||||||||
1 | 39.623887 | 113.505764 | 100.151341 | 63.324286 | 55.740168 | NaN | NaN | 35.5 |
2 | NaN | NaN | NaN | 13.166675 | 11.587500 | 23.75000 | NaN | NaN |
3 | NaN | NaN | NaN | NaN | 11.000000 | 10.61166 | 13.58125 | NaN |
pd.pivot_table(Pclass_cabin, index = 'Survived', columns = 'Cabin', values = 'Fare', aggfunc = np.median)
Cabin | A | B | C | D | E | F | G | T |
---|---|---|---|---|---|---|---|---|
Survived | ||||||||
0 | 37.3896 | 42.7500 | 81.1625 | 43.5604 | 45.18125 | 7.65000 | 10.4625 | 35.5 |
1 | 35.5000 | 91.0792 | 89.1042 | 63.3583 | 39.82500 | 24.17915 | 16.7000 | NaN |
2-7. Fare
# Fare가 10 이하일 경우에는 F, G 랜덤 부여
# Fare가 10 초과 50 이하일 경우에는 A, D, E, T
# Fare가 50 초과일 경우에는 B, C
sns.boxplot(x = "Cabin", y = "Fare", data = Pclass_cabin.sort_values('Cabin'), whis = np.inf)
plt.show()
2-8. Name
# 이름에서 생존여부 차이를 알 수 있을까
train.loc[(train['Name'].str.contains('Mr')) & (train['Name'].str.contains('Mrs') == False)]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
518 rows × 12 columns
# 이름 에서 찾을 수 있는 성별 및 결혼 여부
# 남자 기혼인 경우에 Survived 하지 못하는 경우가 더 많음
f, ax = plt.subplots(4,1, figsize = (17, 10))
train.loc[(train['Name'].str.contains('Mr')) & (train['Name'].str.contains('Mrs') == False), 'Survived'].value_counts().sort_index().plot.bar(ax = ax[0])
ax[0].set_title('Name(Mr) Survived')
train.loc[train['Name'].str.contains('Mrs'), 'Survived'].value_counts().sort_index().plot.bar(ax = ax[1])
ax[1].set_title('Name(Mrs) Survived')
train.loc[train['Name'].str.contains('Miss'), 'Survived'].value_counts().sort_index().plot.bar(ax = ax[2])
ax[2].set_title('Name(Miss) Survived')
train.loc[~train['Name'].str.contains('Mr|Miss|Mrs'), 'Survived'].value_counts().sort_index().plot.bar(ax = ax[3])
ax[3].set_title('Name(Not) Survived')
plt.show()
train['Agegroup'] = train['Age'].apply(lambda x : 'baby' if (x > 0) & (x < 10) else (
'Child' if (x > 10) & (x <= 20) else(
'Teenager' if (x > 20) & (x <= 40) else(
'Young' if (x > 40) & (x <= 50) else(
'Adult' if (x > 50) & (x <= 60) else(
'Senior' if x > 60 else 'Unknown'
))))))
pd.pivot_table(train, index = 'Survived', columns = 'Agegroup', values = 'Fare', aggfunc = 'count')
Agegroup | Adult | Child | Senior | Teenager | Unknown | Young | baby |
---|---|---|---|---|---|---|---|
Survived | |||||||
0 | 25 | 71 | 17 | 232 | 127 | 53 | 24 |
1 | 17 | 44 | 5 | 153 | 52 | 33 | 38 |
2-9. SipSp & Parch
# SipSp 는 Sibling(형제자매) + Spouse(배우자)
# Parch 는 Parents(부모) + Children(자녀)
# SipSp 와 Parch 로 동행 가족의 수를 보여주는 것 같음
train['family_cnt'] = train.apply(lambda x : x['SibSp'] + x['Parch'], axis = 1)
pd.pivot_table(train, index = 'Survived', columns = 'Sex', values = 'family_cnt', aggfunc = np.mean)
Sex | female | male |
---|---|---|
Survived | ||
0 | 2.246914 | 0.647436 |
1 | 1.030043 | 0.743119 |
sns.boxplot(x = "Survived", y = "family_cnt", data = train, hue = 'Sex')
plt.show()
train.loc[train['family_cnt'] > 4, 'Survived'].value_counts()
0 40
1 7
Name: Survived, dtype: int64
2-10. Embarked
sns.countplot(data = train, x = 'Embarked', hue = 'Survived')
plt.show()
pd.pivot_table(train, index = 'Survived', columns = 'Embarked', values = 'family_cnt', aggfunc = 'count')
Embarked | C | Q | S |
---|---|---|---|
Survived | |||
0 | 75 | 47 | 427 |
1 | 93 | 30 | 217 |
3. Preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
class preprocessing(BaseEstimator, TransformerMixin):
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
# 나이 null값 채우기
temp = pd.pivot_table(X, index = 'Pclass', columns = 'Sex', values = 'Age', aggfunc = np.median)
for pclass, sex in X.loc[X['Age'].isnull(), ['Pclass', 'Sex']].drop_duplicates().values:
X.loc[(X['Age'].isnull()) & (X['Pclass'] == pclass) & (X['Sex'] == sex), 'Age'] = temp.loc[pclass, sex]
# 나이 그룹 피처 생성
X['Agegroup'] = X['Age'].apply(lambda x : 'baby' if (x > 0) & (x < 10) else (
'Child' if (x > 10) & (x <= 20) else(
'Teenager' if (x > 20) & (x <= 40) else(
'Young' if (x > 40) & (x <= 50) else(
'Adult' if (x > 50) & (x <= 60) else(
'Senior' if x > 60 else 'Unknown'
))))))
# cabin 피쳐 전처리
X['Cabin'] = X['Cabin'].fillna('X').apply(lambda x : x[:1])
X.loc[X['Cabin'] == 'X', 'Cabin'] = (X.loc[X['Cabin'] == 'X'].apply(lambda x: np.random.choice(['F', 'G']) if x['Fare'] <= 10 else (
np.random.choice(['A', 'D', 'E', 'T']) if x['Fare'] > 10 and x['Fare'] < 50 else
np.random.choice(['B', 'C'])
), axis = 1))
X['Cabin'] = X['Cabin'].apply(lambda x : 1 if x in ['F', 'G'] else ( 2 if x in ['A', 'D', 'E', 'T'] else ( 3 if x in ['B', 'C'] else 4)))
# Fare qcut
X['Fare_qcut'] = pd.qcut(X['Fare'], 5, labels = False)
# Name
X['Name'] = X['Name'].apply(lambda x : 0 if 'Mrs' in x or 'Miss' in x else (1 if 'Mr' in x else 3)).astype(str)
# SipSp & Parch
X['family_cnt'] = X.apply(lambda x : x['SibSp'] + x['Parch'], axis = 1)
X['family_YN'] = X['family_cnt'].apply(lambda x : 1 if x >= 4 else 0)
# Drop Columns
DROP = ['SibSp', 'Parch', 'Ticket']
X = X.drop(DROP, axis = 1)
#
INDEX = ['PassengerId']
Y = ['Survived']
CONTINUOUS = ['Age', 'Fare', 'Fare_qcut']
CATEGORICAL = ['Cabin', 'Pclass', 'Name', 'Sex', 'Agegroup', 'Embarked']
INPUT = pd.concat([pd.get_dummies(X[CATEGORICAL]), X[CONTINUOUS]], axis = 1)
try:
OUTPUT = X[Y]
except:
OUTPUT = None
return INPUT, OUTPUT
preprocessing = preprocessing()
X, Y = preprocessing.fit_transform(train)
X.head()
Cabin
Pclass
Name_0
Name_1
Name_3
Sex_female
Sex_male
Agegroup_Adult
Agegroup_Child
Agegroup_Senior
Agegroup_Teenager
Agegroup_Unknown
Agegroup_Young
Agegroup_baby
Embarked_C
Embarked_Q
Embarked_S
Age
Fare
Fare_qcut
0
1
3
0
1
0
0
1
0
0
0
1
0
0
0
0
0
1
22.0
7.2500
0
1
3
1
1
0
0
1
0
0
0
0
1
0
0
0
1
0
0
38.0
71.2833
4
2
1
3
1
0
0
1
0
0
0
0
1
0
0
0
0
0
1
26.0
7.9250
1
3
3
1
1
0
0
1
0
0
0
0
1
0
0
0
0
0
1
35.0
53.1000
4
4
1
3
0
1
0
0
1
0
0
0
1
0
0
0
0
0
1
35.0
8.0500
1
plt.figure(figsize = (25, 25))
sns.heatmap(X.corr(), annot = True)
plt.show()
from sklearn.base import BaseEstimator, TransformerMixin
class preprocessing(BaseEstimator, TransformerMixin):
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
# 나이 null값 채우기
temp = pd.pivot_table(X, index = 'Pclass', columns = 'Sex', values = 'Age', aggfunc = np.median)
for pclass, sex in X.loc[X['Age'].isnull(), ['Pclass', 'Sex']].drop_duplicates().values:
X.loc[(X['Age'].isnull()) & (X['Pclass'] == pclass) & (X['Sex'] == sex), 'Age'] = temp.loc[pclass, sex]
# 나이 그룹 피처 생성
X['Agegroup'] = X['Age'].apply(lambda x : 'baby' if (x > 0) & (x < 10) else (
'Child' if (x > 10) & (x <= 20) else(
'Teenager' if (x > 20) & (x <= 40) else(
'Young' if (x > 40) & (x <= 50) else(
'Adult' if (x > 50) & (x <= 60) else(
'Senior' if x > 60 else 'Unknown'
))))))
# cabin 피쳐 전처리
X['Cabin'] = X['Cabin'].fillna('X').apply(lambda x : x[:1])
X.loc[X['Cabin'] == 'X', 'Cabin'] = (X.loc[X['Cabin'] == 'X'].apply(lambda x: np.random.choice(['F', 'G']) if x['Fare'] <= 10 else (
np.random.choice(['A', 'D', 'E', 'T']) if x['Fare'] > 10 and x['Fare'] < 50 else
np.random.choice(['B', 'C'])
), axis = 1))
X['Cabin'] = X['Cabin'].apply(lambda x : 1 if x in ['F', 'G'] else ( 2 if x in ['A', 'D', 'E', 'T'] else ( 3 if x in ['B', 'C'] else 4)))
# Fare qcut
X['Fare_qcut'] = pd.qcut(X['Fare'], 5, labels = False)
# Name
X['Name'] = X['Name'].apply(lambda x : 0 if 'Mrs' in x or 'Miss' in x else (1 if 'Mr' in x else 3)).astype(str)
# SipSp & Parch
X['family_cnt'] = X.apply(lambda x : x['SibSp'] + x['Parch'], axis = 1)
X['family_YN'] = X['family_cnt'].apply(lambda x : 1 if x >= 4 else 0)
# Drop Columns
DROP = ['SibSp', 'Parch', 'Ticket']
X = X.drop(DROP, axis = 1)
#
INDEX = ['PassengerId']
Y = ['Survived']
CONTINUOUS = ['Age', 'Fare', 'Fare_qcut']
CATEGORICAL = ['Cabin', 'Pclass', 'Name', 'Sex', 'Agegroup', 'Embarked']
INPUT = pd.concat([pd.get_dummies(X[CATEGORICAL]), X[CONTINUOUS]], axis = 1)
try:
OUTPUT = X[Y]
except:
OUTPUT = None
return INPUT, OUTPUT
preprocessing = preprocessing()
X, Y = preprocessing.fit_transform(train)
X.head()
Cabin | Pclass | Name_0 | Name_1 | Name_3 | Sex_female | Sex_male | Agegroup_Adult | Agegroup_Child | Agegroup_Senior | Agegroup_Teenager | Agegroup_Unknown | Agegroup_Young | Agegroup_baby | Embarked_C | Embarked_Q | Embarked_S | Age | Fare | Fare_qcut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 22.0 | 7.2500 | 0 |
1 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 38.0 | 71.2833 | 4 |
2 | 1 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 26.0 | 7.9250 | 1 |
3 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 35.0 | 53.1000 | 4 |
4 | 1 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 35.0 | 8.0500 | 1 |
plt.figure(figsize = (25, 25))
sns.heatmap(X.corr(), annot = True)
plt.show()
4. Model
4-1. Baseline
from sklearn import model_selection
from sklearn import ensemble, gaussian_process, linear_model, naive_bayes, neighbors, svm, tree, discriminant_analysis
from xgboost import XGBClassifier
# 베이스 모델
MODELS = [
# 앙상블 모델
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
# 가우시안 모델
gaussian_process.GaussianProcessClassifier(),
# 선형 모델
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
# 나이브베이지안 모델
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
# 이웃기반 모델
neighbors.KNeighborsClassifier(),
# SVM
svm.SVC(probability = True),
svm.NuSVC(probability = True),
svm.LinearSVC(),
# 트리 모델
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
# 선형판별분석
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
# xgboost
XGBClassifier()
]
# cross validation
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2, train_size = 0.8, random_state = 42 ) # run model 10x with 60/30 split intentionally leaving out 10%
# 모델 비교를 위한 데이터프레임 생성
Model_columns = ['Model Name', 'Model Parameters', 'Model Train Accuracy Mean', 'Model Test Accuracy Mean', 'Model Test Accuracy 3*STD' ,'Model Time']
Model_compare = pd.DataFrame(columns = Model_columns)
# 모델별 predict 결과 저장
Model_predict = Y.copy()
# MLA_compare 데이터프레임에 각 모델 결과 저장
row_index = 0
for alg in MODELS:
# 모델별 base Parameter
Model_name = alg.__class__.__name__
Model_compare.loc[row_index, 'Model Name'] = Model_name
Model_compare.loc[row_index, 'Model Parameters'] = str(alg.get_params())
cv_results = model_selection.cross_validate(alg, X = X, y = Y, cv = cv_split, return_train_score = True)
Model_compare.loc[row_index, 'Model Time'] = cv_results['fit_time'].mean()
Model_compare.loc[row_index, 'Model Train Accuracy Mean'] = cv_results['train_score'].mean() # cross_validate에서 'train_score' 나오지 않음
Model_compare.loc[row_index, 'Model Test Accuracy Mean'] = cv_results['test_score'].mean()
Model_compare.loc[row_index, 'Model Test Accuracy 3*STD'] = cv_results['test_score'].std()*3
# 모델별 predict 값 저장
alg.fit(X, Y)
Model_predict[Model_name] = alg.predict(X)
row_index+=1
Model_compare = Model_compare.sort_values('Model Test Accuracy Mean', ascending = False).reset_index(drop = True)
Model_compare
Model Name
Model Parameters
Model Train Accuracy Mean
Model Test Accuracy Mean
Model Test Accuracy 3*STD
Model Time
0
GradientBoostingClassifier
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse'...
0.903652
0.830168
0.075772
0.0869468
1
XGBClassifier
{'base_score': 0.5, 'booster': 'gbtree', 'cols...
0.883567
0.828492
0.0674776
0.0911861
2
RandomForestClassifier
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...
0.984551
0.818436
0.0835471
0.135578
3
AdaBoostClassifier
{'algorithm': 'SAMME.R', 'base_estimator': Non...
0.83427
0.815642
0.0782522
0.0718477
4
BaggingClassifier
{'base_estimator': None, 'bootstrap': True, 'b...
0.968118
0.808939
0.0895669
0.0257989
5
RidgeClassifierCV
{'alphas': array([ 0.1, 1. , 10. ]), 'class_w...
0.806039
0.807263
0.0845497
0.0103013
6
ExtraTreesClassifier
{'bootstrap': False, 'ccp_alpha': 0.0, 'class_...
0.984551
0.805028
0.0798336
0.114957
7
LinearDiscriminantAnalysis
{'n_components': None, 'priors': None, 'shrink...
0.807584
0.805028
0.0780545
0.00644715
8
LogisticRegressionCV
{'Cs': 10, 'class_weight': None, 'cv': None, '...
0.809831
0.803911
0.0801846
0.984159
9
BernoulliNB
{'alpha': 1.0, 'binarize': 0.0, 'class_prior':...
0.786376
0.793855
0.0819174
0.00365911
10
NuSVC
{'break_ties': False, 'cache_size': 200, 'clas...
0.795646
0.789385
0.104678
0.09748
11
DecisionTreeClassifier
{'ccp_alpha': 0.0, 'class_weight': None, 'crit...
0.984551
0.788268
0.0907043
0.00587251
12
ExtraTreeClassifier
{'ccp_alpha': 0.0, 'class_weight': None, 'crit...
0.984551
0.780447
0.071911
0.00419157
13
GaussianNB
{'priors': None, 'var_smoothing': 1e-09}
0.760393
0.758659
0.0741229
0.00369005
14
LinearSVC
{'C': 1.0, 'class_weight': None, 'dual': True,...
0.72809
0.741899
0.314066
0.0319866
15
GaussianProcessClassifier
{'copy_X_train': True, 'kernel': None, 'max_it...
0.956601
0.726816
0.11279
0.159213
16
KNeighborsClassifier
{'algorithm': 'auto', 'leaf_size': 30, 'metric...
0.805197
0.722905
0.0536313
0.00471177
17
SGDClassifier
{'alpha': 0.0001, 'average': False, 'class_wei...
0.699719
0.714525
0.0903941
0.00653226
18
PassiveAggressiveClassifier
{'C': 1.0, 'average': False, 'class_weight': N...
0.684129
0.672067
0.282446
0.00496163
19
SVC
{'C': 1.0, 'break_ties': False, 'cache_size': ...
0.682022
0.667598
0.0700109
0.0661206
20
Perceptron
{'alpha': 0.0001, 'class_weight': None, 'early...
0.65618
0.651955
0.40516
0.00473375
21
QuadraticDiscriminantAnalysis
{'priors': None, 'reg_param': 0.0, 'store_cova...
0.569101
0.556425
0.305672
0.0052588
sns.barplot(x = 'Model Test Accuracy Mean', y = 'Model Name', data = Model_compare, color = 'm')
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
plt.show()
from sklearn import model_selection
from sklearn import ensemble, gaussian_process, linear_model, naive_bayes, neighbors, svm, tree, discriminant_analysis
from xgboost import XGBClassifier
# 베이스 모델
MODELS = [
# 앙상블 모델
ensemble.AdaBoostClassifier(),
ensemble.BaggingClassifier(),
ensemble.ExtraTreesClassifier(),
ensemble.GradientBoostingClassifier(),
ensemble.RandomForestClassifier(),
# 가우시안 모델
gaussian_process.GaussianProcessClassifier(),
# 선형 모델
linear_model.LogisticRegressionCV(),
linear_model.PassiveAggressiveClassifier(),
linear_model.RidgeClassifierCV(),
linear_model.SGDClassifier(),
linear_model.Perceptron(),
# 나이브베이지안 모델
naive_bayes.BernoulliNB(),
naive_bayes.GaussianNB(),
# 이웃기반 모델
neighbors.KNeighborsClassifier(),
# SVM
svm.SVC(probability = True),
svm.NuSVC(probability = True),
svm.LinearSVC(),
# 트리 모델
tree.DecisionTreeClassifier(),
tree.ExtraTreeClassifier(),
# 선형판별분석
discriminant_analysis.LinearDiscriminantAnalysis(),
discriminant_analysis.QuadraticDiscriminantAnalysis(),
# xgboost
XGBClassifier()
]
# cross validation
cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = 0.2, train_size = 0.8, random_state = 42 ) # run model 10x with 60/30 split intentionally leaving out 10%
# 모델 비교를 위한 데이터프레임 생성
Model_columns = ['Model Name', 'Model Parameters', 'Model Train Accuracy Mean', 'Model Test Accuracy Mean', 'Model Test Accuracy 3*STD' ,'Model Time']
Model_compare = pd.DataFrame(columns = Model_columns)
# 모델별 predict 결과 저장
Model_predict = Y.copy()
# MLA_compare 데이터프레임에 각 모델 결과 저장
row_index = 0
for alg in MODELS:
# 모델별 base Parameter
Model_name = alg.__class__.__name__
Model_compare.loc[row_index, 'Model Name'] = Model_name
Model_compare.loc[row_index, 'Model Parameters'] = str(alg.get_params())
cv_results = model_selection.cross_validate(alg, X = X, y = Y, cv = cv_split, return_train_score = True)
Model_compare.loc[row_index, 'Model Time'] = cv_results['fit_time'].mean()
Model_compare.loc[row_index, 'Model Train Accuracy Mean'] = cv_results['train_score'].mean() # cross_validate에서 'train_score' 나오지 않음
Model_compare.loc[row_index, 'Model Test Accuracy Mean'] = cv_results['test_score'].mean()
Model_compare.loc[row_index, 'Model Test Accuracy 3*STD'] = cv_results['test_score'].std()*3
# 모델별 predict 값 저장
alg.fit(X, Y)
Model_predict[Model_name] = alg.predict(X)
row_index+=1
Model_compare = Model_compare.sort_values('Model Test Accuracy Mean', ascending = False).reset_index(drop = True)
Model_compare
Model Name | Model Parameters | Model Train Accuracy Mean | Model Test Accuracy Mean | Model Test Accuracy 3*STD | Model Time | |
---|---|---|---|---|---|---|
0 | GradientBoostingClassifier | {'ccp_alpha': 0.0, 'criterion': 'friedman_mse'... | 0.903652 | 0.830168 | 0.075772 | 0.0869468 |
1 | XGBClassifier | {'base_score': 0.5, 'booster': 'gbtree', 'cols... | 0.883567 | 0.828492 | 0.0674776 | 0.0911861 |
2 | RandomForestClassifier | {'bootstrap': True, 'ccp_alpha': 0.0, 'class_w... | 0.984551 | 0.818436 | 0.0835471 | 0.135578 |
3 | AdaBoostClassifier | {'algorithm': 'SAMME.R', 'base_estimator': Non... | 0.83427 | 0.815642 | 0.0782522 | 0.0718477 |
4 | BaggingClassifier | {'base_estimator': None, 'bootstrap': True, 'b... | 0.968118 | 0.808939 | 0.0895669 | 0.0257989 |
5 | RidgeClassifierCV | {'alphas': array([ 0.1, 1. , 10. ]), 'class_w... | 0.806039 | 0.807263 | 0.0845497 | 0.0103013 |
6 | ExtraTreesClassifier | {'bootstrap': False, 'ccp_alpha': 0.0, 'class_... | 0.984551 | 0.805028 | 0.0798336 | 0.114957 |
7 | LinearDiscriminantAnalysis | {'n_components': None, 'priors': None, 'shrink... | 0.807584 | 0.805028 | 0.0780545 | 0.00644715 |
8 | LogisticRegressionCV | {'Cs': 10, 'class_weight': None, 'cv': None, '... | 0.809831 | 0.803911 | 0.0801846 | 0.984159 |
9 | BernoulliNB | {'alpha': 1.0, 'binarize': 0.0, 'class_prior':... | 0.786376 | 0.793855 | 0.0819174 | 0.00365911 |
10 | NuSVC | {'break_ties': False, 'cache_size': 200, 'clas... | 0.795646 | 0.789385 | 0.104678 | 0.09748 |
11 | DecisionTreeClassifier | {'ccp_alpha': 0.0, 'class_weight': None, 'crit... | 0.984551 | 0.788268 | 0.0907043 | 0.00587251 |
12 | ExtraTreeClassifier | {'ccp_alpha': 0.0, 'class_weight': None, 'crit... | 0.984551 | 0.780447 | 0.071911 | 0.00419157 |
13 | GaussianNB | {'priors': None, 'var_smoothing': 1e-09} | 0.760393 | 0.758659 | 0.0741229 | 0.00369005 |
14 | LinearSVC | {'C': 1.0, 'class_weight': None, 'dual': True,... | 0.72809 | 0.741899 | 0.314066 | 0.0319866 |
15 | GaussianProcessClassifier | {'copy_X_train': True, 'kernel': None, 'max_it... | 0.956601 | 0.726816 | 0.11279 | 0.159213 |
16 | KNeighborsClassifier | {'algorithm': 'auto', 'leaf_size': 30, 'metric... | 0.805197 | 0.722905 | 0.0536313 | 0.00471177 |
17 | SGDClassifier | {'alpha': 0.0001, 'average': False, 'class_wei... | 0.699719 | 0.714525 | 0.0903941 | 0.00653226 |
18 | PassiveAggressiveClassifier | {'C': 1.0, 'average': False, 'class_weight': N... | 0.684129 | 0.672067 | 0.282446 | 0.00496163 |
19 | SVC | {'C': 1.0, 'break_ties': False, 'cache_size': ... | 0.682022 | 0.667598 | 0.0700109 | 0.0661206 |
20 | Perceptron | {'alpha': 0.0001, 'class_weight': None, 'early... | 0.65618 | 0.651955 | 0.40516 | 0.00473375 |
21 | QuadraticDiscriminantAnalysis | {'priors': None, 'reg_param': 0.0, 'store_cova... | 0.569101 | 0.556425 | 0.305672 | 0.0052588 |
sns.barplot(x = 'Model Test Accuracy Mean', y = 'Model Name', data = Model_compare, color = 'm')
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score (%)')
plt.ylabel('Algorithm')
plt.show()
4-2. Ensemble
# 상위 10개 모델만 선정
TOP = []
for name in Model_compare['Model Name'].values:
for alg in MODELS:
if name in str(alg):
try: # predict_proba 가 존재하는 모델만 선별
alg.predict_proba
v = (name, alg)
TOP.append(v)
except:
pass
TOP
[('GradientBoostingClassifier', GradientBoostingClassifier()),
('XGBClassifier', XGBClassifier()),
('RandomForestClassifier', RandomForestClassifier()),
('AdaBoostClassifier', AdaBoostClassifier()),
('BaggingClassifier', BaggingClassifier()),
('ExtraTreesClassifier', ExtraTreesClassifier()),
('LinearDiscriminantAnalysis', LinearDiscriminantAnalysis()),
('LogisticRegressionCV', LogisticRegressionCV()),
('BernoulliNB', BernoulliNB()),
('NuSVC', NuSVC(probability=True)),
('DecisionTreeClassifier', DecisionTreeClassifier()),
('ExtraTreeClassifier', ExtraTreeClassifier()),
('GaussianNB', GaussianNB()),
('GaussianProcessClassifier', GaussianProcessClassifier()),
('KNeighborsClassifier', KNeighborsClassifier()),
('SVC', SVC(probability=True)),
('SVC', NuSVC(probability=True)),
('QuadraticDiscriminantAnalysis', QuadraticDiscriminantAnalysis())]
vote_est = TOP[:9]
vote_est
[('GradientBoostingClassifier', GradientBoostingClassifier()),
('XGBClassifier', XGBClassifier()),
('RandomForestClassifier', RandomForestClassifier()),
('AdaBoostClassifier', AdaBoostClassifier()),
('BaggingClassifier', BaggingClassifier()),
('ExtraTreesClassifier', ExtraTreesClassifier()),
('LinearDiscriminantAnalysis', LinearDiscriminantAnalysis()),
('LogisticRegressionCV', LogisticRegressionCV()),
('BernoulliNB', BernoulliNB())]
def voting(model_candidates):
N = len(model_candidates)
history = []
for i in reversed(range(2, N+1)):
vote_est = model_candidates[:i]
print('=' * 15, f'voting {i} Model', '=' * 15)
vote_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, X, Y, cv = cv_split)
# print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
# print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-' * 40)
# Soft Vote
vote_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, X, Y, cv = cv_split)
# print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
# print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
value = [i, vote_hard_cv['test_score'].mean(), vote_soft_cv['test_score'].mean()]
history.append(value)
print('=' * 40)
return history
history = voting(vote_est)
=============== voting 9 Model ===============
----------------------------------------
========================================
=============== voting 8 Model ===============
----------------------------------------
========================================
=============== voting 7 Model ===============
----------------------------------------
========================================
=============== voting 6 Model ===============
----------------------------------------
========================================
=============== voting 5 Model ===============
----------------------------------------
========================================
=============== voting 4 Model ===============
----------------------------------------
========================================
=============== voting 3 Model ===============
----------------------------------------
========================================
=============== voting 2 Model ===============
----------------------------------------
========================================
pd.DataFrame(history, columns = ['model_cnt', 'hard_vote_score', 'soft_vote_score'])
model_cnt | hard_vote_score | soft_vote_score | |
---|---|---|---|
0 | 9 | 0.836313 | 0.843017 |
1 | 8 | 0.836313 | 0.836872 |
2 | 7 | 0.837989 | 0.840782 |
3 | 6 | 0.829609 | 0.830168 |
4 | 5 | 0.834078 | 0.840782 |
5 | 4 | 0.829050 | 0.839106 |
6 | 3 | 0.835196 | 0.837430 |
7 | 2 | 0.831285 | 0.832961 |
4-3. HyperParameter Tuning
grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]
grid_params = {
'RandomForestClassifier' : {
'n_estimators' : grid_n_estimator,
'criterion': grid_criterion,
'max_depth': grid_max_depth,
'oob_score': [True],
'random_state': grid_seed
},
'XGBClassifier' : {
'learning_rate': grid_learn,
'max_depth': [1,2,4,6,8,10],
'n_estimators': grid_n_estimator,
'seed': grid_seed
},
'GradientBoostingClassifier' : {
'learning_rate': [.05],
'n_estimators': [300],
'max_depth': grid_max_depth, #default=3
'random_state': grid_seed
},
'BaggingClassifier' : {
'n_estimators': grid_n_estimator,
'max_samples': grid_ratio,
'random_state': grid_seed
},
'LinearDiscriminantAnalysis' : {
'solver' : ['svd', 'lsqr', 'eigen']
},
'LogisticRegressionCV' : {
'fit_intercept': grid_bool,
'penalty': ['l1','l2'],
'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
'random_state': grid_seed
},
'AdaBoostClassifier' : {
'n_estimators': grid_n_estimator,
'learning_rate': grid_learn,
'random_state': grid_seed
},
'ExtraTreesClassifier' : {
'n_estimators': grid_n_estimator,
'criterion': grid_criterion,
'max_depth': grid_max_depth,
'random_state': grid_seed
},
'NuSVC' : {
'gamma': grid_ratio,
'decision_function_shape': ['ovo', 'ovr'],
'probability': [True],
'random_state': grid_seed
}
}
import time
vote_est[:6]
[('GradientBoostingClassifier', GradientBoostingClassifier()),
('XGBClassifier', XGBClassifier()),
('RandomForestClassifier', RandomForestClassifier()),
('AdaBoostClassifier', AdaBoostClassifier()),
('BaggingClassifier', BaggingClassifier()),
('ExtraTreesClassifier', ExtraTreesClassifier())]
start_total = time.perf_counter()
i = int(input())
MODELS = vote_est[:i]
for name, model in MODELS:
start = time.perf_counter()
best_search = model_selection.GridSearchCV(estimator = model, param_grid = grid_params[name], cv = cv_split, scoring = 'roc_auc')
best_search.fit(X, Y)
run = time.perf_counter() - start
best_param = best_search.best_params_
print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(name, best_param, run))
model.set_params(**best_param)
run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
6
The best parameter for GradientBoostingClassifier is {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 300, 'random_state': 0} with a runtime of 54.52 seconds.
The best parameter for XGBClassifier is {'learning_rate': 0.03, 'max_depth': 4, 'n_estimators': 300, 'seed': 0} with a runtime of 159.25 seconds.
The best parameter for RandomForestClassifier is {'criterion': 'gini', 'max_depth': 8, 'n_estimators': 300, 'oob_score': True, 'random_state': 0} with a runtime of 88.92 seconds.
The best parameter for AdaBoostClassifier is {'learning_rate': 0.1, 'n_estimators': 300, 'random_state': 0} with a runtime of 33.23 seconds.
The best parameter for BaggingClassifier is {'max_samples': 0.25, 'n_estimators': 300, 'random_state': 0} with a runtime of 41.43 seconds.
The best parameter for ExtraTreesClassifier is {'criterion': 'gini', 'max_depth': 6, 'n_estimators': 300, 'random_state': 0} with a runtime of 57.15 seconds.
Total optimization time was 12.13 minutes.
history = voting(vote_est)
=============== voting 9 Model ===============
----------------------------------------
========================================
=============== voting 8 Model ===============
----------------------------------------
========================================
=============== voting 7 Model ===============
----------------------------------------
========================================
=============== voting 6 Model ===============
----------------------------------------
========================================
=============== voting 5 Model ===============
----------------------------------------
========================================
=============== voting 4 Model ===============
----------------------------------------
========================================
=============== voting 3 Model ===============
----------------------------------------
========================================
=============== voting 2 Model ===============
----------------------------------------
========================================
pd.DataFrame(history, columns = ['model_cnt', 'hard_vote_score', 'soft_vote_score'])
model_cnt | hard_vote_score | soft_vote_score | |
---|---|---|---|
0 | 9 | 0.827933 | 0.837430 |
1 | 8 | 0.836872 | 0.840782 |
2 | 7 | 0.838547 | 0.843017 |
3 | 6 | 0.843017 | 0.845251 |
4 | 5 | 0.844134 | 0.845810 |
5 | 4 | 0.839106 | 0.848045 |
6 | 3 | 0.848603 | 0.848045 |
7 | 2 | 0.844693 | 0.846927 |
i = 6
MODELS = vote_est[:i]
vote_hard = ensemble.VotingClassifier(estimators = MODELS , voting = 'hard')
vote_hard_cv = model_selection.cross_validate(vote_hard, X, Y, cv = cv_split)
vote_hard.fit(X, Y)
print("Hard Voting Test w/bin score mean: {:.2f}". format(vote_hard_cv['test_score'].mean()*100))
print("Hard Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_hard_cv['test_score'].std()*100*3))
print('-' * 40)
# Soft Vote
vote_soft = ensemble.VotingClassifier(estimators = MODELS , voting = 'soft')
vote_soft_cv = model_selection.cross_validate(vote_soft, X, Y, cv = cv_split)
vote_soft.fit(X, Y)
print("Soft Voting Test w/bin score mean: {:.2f}". format(vote_soft_cv['test_score'].mean()*100))
print("Soft Voting Test w/bin score 3*std: +/- {:.2f}". format(vote_soft_cv['test_score'].std()*100*3))
print('=' * 40)
6
Hard Voting Test w/bin score mean: 84.30
Hard Voting Test w/bin score 3*std: +/- 6.89
----------------------------------------
Soft Voting Test w/bin score mean: 84.53
Soft Voting Test w/bin score 3*std: +/- 7.07
========================================
5. submission
# test 전처리
X_test, _ = preprocessing.transform(test)
X_test.head()
Cabin
Pclass
Name_0
Name_1
Name_3
Sex_female
Sex_male
Agegroup_Adult
Agegroup_Child
Agegroup_Senior
Agegroup_Teenager
Agegroup_Unknown
Agegroup_Young
Agegroup_baby
Embarked_C
Embarked_Q
Embarked_S
Age
Fare
Fare_qcut
0
1
3
0
1
0
0
1
0
0
0
1
0
0
0
0
1
0
34.5
7.8292
1.0
1
1
3
1
0
0
1
0
0
0
0
0
0
1
0
0
0
1
47.0
7.0000
0.0
2
1
2
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
62.0
9.6875
1.0
3
1
3
0
1
0
0
1
0
0
0
1
0
0
0
0
0
1
27.0
8.6625
1.0
4
2
3
1
0
0
1
0
0
0
0
1
0
0
0
0
0
1
22.0
12.2875
2.0
X_test.isnull().sum()
Cabin 0
Pclass 0
Name_0 0
Name_1 0
Name_3 0
Sex_female 0
Sex_male 0
Agegroup_Adult 0
Agegroup_Child 0
Agegroup_Senior 0
Agegroup_Teenager 0
Agegroup_Unknown 0
Agegroup_Young 0
Agegroup_baby 0
Embarked_C 0
Embarked_Q 0
Embarked_S 0
Age 0
Fare 1
Fare_qcut 1
dtype: int64
X_test = X_test.fillna(0)
X.shape, X_test.shape
((891, 20), (418, 20))
5-1. prediction
sub = pd.read_csv(os.path.join('data', 'gender_submission.csv'))
sub.head()
PassengerId
Survived
0
892
0
1
893
1
2
894
0
3
895
0
4
896
1
pred_vote_hard = vote_hard.predict(X_test)
pred_vote_soft = vote_soft.predict(X_test)
for md, pred in zip(['hard', 'soft'], [pred_vote_hard, pred_vote_soft]):
sub['Survived'] = pred
sub.to_csv(os.path.join('data', 'submission_{}.csv'.format(md)), index = False)
Author And Source
이 문제에 관하여([Kaggle]Titanic - Machine Learning from Disaster), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다
https://velog.io/@jinseock95/KaggleTitanic-Machine-Learning-from-Disaster
저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념
(Collection and Share based on the CC Protocol.)
# test 전처리
X_test, _ = preprocessing.transform(test)
X_test.head()
Cabin | Pclass | Name_0 | Name_1 | Name_3 | Sex_female | Sex_male | Agegroup_Adult | Agegroup_Child | Agegroup_Senior | Agegroup_Teenager | Agegroup_Unknown | Agegroup_Young | Agegroup_baby | Embarked_C | Embarked_Q | Embarked_S | Age | Fare | Fare_qcut | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 34.5 | 7.8292 | 1.0 |
1 | 1 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 47.0 | 7.0000 | 0.0 |
2 | 1 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 62.0 | 9.6875 | 1.0 |
3 | 1 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 27.0 | 8.6625 | 1.0 |
4 | 2 | 3 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 22.0 | 12.2875 | 2.0 |
X_test.isnull().sum()
Cabin 0
Pclass 0
Name_0 0
Name_1 0
Name_3 0
Sex_female 0
Sex_male 0
Agegroup_Adult 0
Agegroup_Child 0
Agegroup_Senior 0
Agegroup_Teenager 0
Agegroup_Unknown 0
Agegroup_Young 0
Agegroup_baby 0
Embarked_C 0
Embarked_Q 0
Embarked_S 0
Age 0
Fare 1
Fare_qcut 1
dtype: int64
X_test = X_test.fillna(0)
X.shape, X_test.shape
((891, 20), (418, 20))
sub = pd.read_csv(os.path.join('data', 'gender_submission.csv'))
sub.head()
PassengerId | Survived | |
---|---|---|
0 | 892 | 0 |
1 | 893 | 1 |
2 | 894 | 0 |
3 | 895 | 0 |
4 | 896 | 1 |
pred_vote_hard = vote_hard.predict(X_test)
pred_vote_soft = vote_soft.predict(X_test)
for md, pred in zip(['hard', 'soft'], [pred_vote_hard, pred_vote_soft]):
sub['Survived'] = pred
sub.to_csv(os.path.join('data', 'submission_{}.csv'.format(md)), index = False)
Author And Source
이 문제에 관하여([Kaggle]Titanic - Machine Learning from Disaster), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@jinseock95/KaggleTitanic-Machine-Learning-from-Disaster저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)