machine learning/Concrete Compressive Strength Dataset

콘크리트 데이터셋 도전

1. 데이터 탐색 및 전처리

  • 데이터 적재
import pandas as pd
concrete = pd.read_excel('datasets/Concrete_Data.xls')
concrete
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.269535
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.052780
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.296075
... ... ... ... ... ... ... ... ... ...
1025 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.284354
1026 322.2 0.0 115.6 196.0 10.4 817.9 813.4 28 31.178794
1027 148.5 139.4 108.6 192.7 6.1 892.4 780.0 28 23.696601
1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.768036
1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.401235

1030 rows × 9 columns

  • Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable

  • Age -- quantitative -- Day (1~365) -- Input Variable

  • Concrete compressive strength -- quantitative -- MPa -- Output Variable

  • columns를 단순한 이름으로 변경

concrete.columns = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg', 'fineagg', 'age', 'strength']
  • 앞의 2행 확인하기
concrete.head(2)
cement slag ash water superplastic coarseagg fineagg age strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.986111
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.887366
  • 뒤에 2행 확인하기
concrete.tail(2)
cement slag ash water superplastic coarseagg fineagg age strength
1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.768036
1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.401235
  • 통계치 살펴보기
concrete.describe()
cement slag ash water superplastic coarseagg fineagg age strength
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.165631 73.895485 54.187136 181.566359 6.203112 972.918592 773.578883 45.662136 35.817836
std 104.507142 86.279104 63.996469 21.355567 5.973492 77.753818 80.175427 63.169912 16.705679
min 102.000000 0.000000 0.000000 121.750000 0.000000 801.000000 594.000000 1.000000 2.331808
25% 192.375000 0.000000 0.000000 164.900000 0.000000 932.000000 730.950000 7.000000 23.707115
50% 272.900000 22.000000 0.000000 185.000000 6.350000 968.000000 779.510000 28.000000 34.442774
75% 350.000000 142.950000 118.270000 192.000000 10.160000 1029.400000 824.000000 56.000000 46.136287
max 540.000000 359.400000 200.100000 247.000000 32.200000 1145.000000 992.600000 365.000000 82.599225
  • 누락 정보 확인하기
concrete.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
concrete.isnull().sum(0)
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64
  • 각 열의 평균값 확인하기
concrete.mean()
cement          281.165631
slag             73.895485
ash              54.187136
water           181.566359
superplastic      6.203112
coarseagg       972.918592
fineagg         773.578883
age              45.662136
strength         35.817836
dtype: float64
  • slag 열의 데이터 분포 확인하기
concrete["slag"] #concrete.slag
0         0.0
1         0.0
2       142.5
3       142.5
4       132.4
        ...  
1025    116.0
1026      0.0
1027    139.4
1028    186.7
1029    100.5
Name: slag, Length: 1030, dtype: float64
  • 위의 수치 데이터로는 특잇값을 한눈에 알아보기 어려우므로 plot으로 확인
concrete["slag"].plot()
<AxesSubplot:>

  • slag 데이터 값 중 300을 넘어가는 값이 확인됨
  • value_counts를 통해 초과되는값이 10개 있음을 알 수 있음
concrete["slag"]>300
0       False
1       False
2       False
3       False
4       False
        ...  
1025    False
1026    False
1027    False
1028    False
1029    False
Name: slag, Length: 1030, dtype: bool
(concrete["slag"]>300).value_counts()
False    1020
True       10
Name: slag, dtype: int64
  • boxplot으로 보다 편리하게 outlier 확인하기
from IPython.display import Image
Image(filename='boxplot.png', width=400)

  • boxplot() 함수는 이상치(플라이어(flier)라고 부름)를 감지하고 수염 부분에 이를 포함시키지 않음
  • 1사분위가 𝑄1 이고 3사분위가 𝑄3 이라면 사분위수 범위는 𝐼𝑄𝑅=𝑄3−𝑄1 (이 값이 박스의 높이)
  • 𝑄1−1.5×𝐼𝑄𝑅 보다 낮거나 𝑄3+1.5×𝐼𝑄𝑅 보다 높은 점수는 이상치로 간주
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.boxplot(concrete["slag"])
{'whiskers': [<matplotlib.lines.Line2D at 0x202cd7c23a0>,
  <matplotlib.lines.Line2D at 0x202cd7c2700>],
 'caps': [<matplotlib.lines.Line2D at 0x202cd7c2a60>,
  <matplotlib.lines.Line2D at 0x202cd7c2dc0>],
 'boxes': [<matplotlib.lines.Line2D at 0x202cd7c2040>],
 'medians': [<matplotlib.lines.Line2D at 0x202cd7ce160>],
 'fliers': [<matplotlib.lines.Line2D at 0x202cd7ce4c0>],
 'means': []}

  • 모든 특성에 대해 outlier 확인 (seaborn 의 boxplot 이용)
concrete.columns
Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')
import seaborn as sns
for i in concrete.columns:
    plt.figure(figsize=(5,3))
    sns.boxplot(data=concrete[i])
    plt.title("{} distribution".format(i))

names = list(concrete.columns)

fig, axes = plt.subplots(len(names), 1, figsize=(4, 30), constrained_layout=True)

for i in range(len(names)):
    sns.boxplot(data = concrete[names[i]], ax=axes[i]).set_title(names[i] + ' distribution')

  • 아래와 같이 outlier를 제외한 부분만 슬라이싱
q1 = concrete['slag'].quantile(0.25)
q3 = concrete['slag'].quantile(0.75)
iqr = q3-q1
n=(concrete['slag']<q1-1.5*iqr)|(concrete['slag']>q3+1.5*iqr)
concrete[n]
cement slag ash water superplastic coarseagg fineagg age strength
553 239.6 359.4 0.0 185.7 0.0 941.6 664.3 7 25.422359
559 239.6 359.4 0.0 185.7 0.0 941.6 664.3 28 39.435269
concrete[n]
cement slag ash water superplastic coarseagg fineagg age strength
553 239.6 359.4 0.0 185.7 0.0 941.6 664.3 7 25.422359
559 239.6 359.4 0.0 185.7 0.0 941.6 664.3 28 39.435269
  • 아래와 같이 outlier를 제외한 부분만 슬라이싱 (전체 특성)
x = concrete.iloc[:,:-1]

q1 = x.quantile(0.25)
q3 = x.quantile(0.75)

iqr = q3 - q1

concrete.iloc[:,:-1] = x[(x >= q1-1.5*iqr) & (x <= q3+iqr*1.5)]
#concrete= concrete.fillna(concrete.median()) # 대체하고 싶은 경우
concrete.dropna(inplace=True)
  • 아웃라이어들이 제거 되었음을 확인
concrete.describe()
cement slag ash water superplastic coarseagg fineagg age strength
count 945.000000 945.000000 945.00000 945.000000 945.000000 945.000000 945.000000 945.000000 945.000000
mean 276.255714 74.192011 57.77291 180.526561 6.262228 974.975291 776.264868 32.453968 35.083287
std 102.121900 86.580718 64.33716 18.789734 5.356791 77.688240 75.092246 28.572875 16.827756
min 102.000000 0.000000 0.00000 127.000000 0.000000 801.000000 594.000000 1.000000 2.331808
25% 190.340000 0.000000 0.00000 164.900000 0.000000 932.000000 739.300000 7.000000 22.718234
50% 260.000000 20.000000 0.00000 183.800000 7.000000 968.000000 779.700000 28.000000 33.418902
75% 349.000000 145.000000 118.31000 192.000000 10.300000 1033.000000 822.200000 28.000000 45.084836
max 540.000000 342.100000 200.10000 228.000000 22.100000 1145.000000 945.000000 120.000000 82.599225
  • 중복 데이터 확인하고 삭제하기
concrete.duplicated().sum()
25
concrete.duplicated().value_counts()
False    920
True      25
dtype: int64
concrete.drop_duplicates(inplace=True)
  • 25개 삭제 되었음을 확인
concrete.describe()
cement slag ash water superplastic coarseagg fineagg age strength
count 920.000000 920.000000 920.000000 920.000000 920.000000 920.000000 920.000000 920.000000 920.000000
mean 273.351359 72.176576 59.342826 181.053261 6.076527 976.623750 775.363152 32.307609 34.443325
std 101.730901 86.475855 64.487271 18.726740 5.274854 77.450242 75.175200 28.447534 16.344159
min 102.000000 0.000000 0.000000 127.000000 0.000000 801.000000 594.000000 1.000000 2.331808
25% 190.340000 0.000000 0.000000 166.610000 0.000000 932.000000 736.900000 7.000000 22.435549
50% 253.250000 19.000000 0.000000 185.000000 6.710000 971.800000 780.000000 28.000000 33.202062
75% 337.900000 143.250000 118.700000 192.000000 10.000000 1040.000000 821.550000 28.000000 44.305728
max 540.000000 342.100000 200.100000 228.000000 22.100000 1145.000000 945.000000 120.000000 82.599225
  • 타겟 데이터를 제외한 특성 데이터만으로 plot 만들기
  • (그래프를 통해 각 특성들의 범위가 다름을 알 수 있음)
cnum = concrete.iloc[:,:-1]
cnum.plot()
<AxesSubplot:>

  • 정규화, 표준화 작업하기
import numpy as np
def minmax_normalize(arr):
    return (arr-arr.min())/(arr.max()-arr.min())

def zscore_standize(arr):
    return (arr - arr.mean())/arr.std()
  • DataFrame의 apply 함수를 통해 정규화(normalize)
  • 데이터가 0과 1사이의 범위로 변경되었음을 확인
concrete_mm = cnum.apply(minmax_normalize)
concrete_mm.plot()
<AxesSubplot:>

  • DataFrame의 apply 함수를 통해 표준화(standarize)
  • 데이터의 평균이 0으로 맞추어졌음을 확인
concrete_zs = cnum.apply(zscore_standize)
concrete_zs.plot()
<AxesSubplot:>

2. 모델 훈련

  1. 특성과 레이블 준비

  2. 훈련세트와 테스트세트 분리

  3. linearRegresssion 모델 훈련 (추가적으로 DecisionTree, RandomForest 훈련)

  4. 훈련 데이터로 성능 평가 - RMSE

  5. 교차검증 (모델간 성능 비교)

  6. 그리드 탐색 RandomForest : n_estimators (50, 100,200) 파라미터 탐색

  7. 최적 모델로 예측 - 테스트 세트

cross_val_score?
from sklearn.model_selection import train_test_split
X = concrete_zs
y = concrete.strength
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
  • LinearRegression 훈련
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
LinearRegression()
predict = lin_reg.predict(X_train)
from sklearn.metrics import mean_squared_error
lin_rmse = mean_squared_error(y_train, predict, squared=False)
lin_rmse
7.8871439080093575
  • DecisionTree 훈련
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state = 42)
tree_reg.fit(X_train, y_train)
DecisionTreeRegressor(random_state=42)
predict = tree_reg.predict(X_train)
tree_rmse = mean_squared_error(y_train, predict, squared=False)
tree_rmse
0.8704077845616891
  • RandomForest 훈련
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)
RandomForestRegressor(random_state=42)
predict = forest_reg.predict(X_train)
forest_rmse = mean_squared_error(y_train, predict, squared=False)
forest_rmse
2.103342396019283
  • 교차 검증
from sklearn.model_selection import cross_val_score
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
lin_scores = np.sqrt(-lin_scores)
lin_scores.mean()
7.942955496320373
tree_scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
tree_scores = np.sqrt(-tree_scores)
tree_scores.mean()
7.492408361697325
forest_scores = cross_val_score(forest_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
forest_scores = np.sqrt(-forest_scores)
forest_scores.mean()
5.605067301171005
  • 그리드 탐색 (세부튜닝)
from sklearn.model_selection import GridSearchCV
param_grid = [{'n_estimators' : [50, 100, 200]}]
forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs = -1) # 3 * 5
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid=[{'n_estimators': [50, 100, 200]}],
             scoring='neg_mean_squared_error')
grid_search.best_params_
{'n_estimators': 200}
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
5.6762297701241184 {'n_estimators': 50}
5.653618133722385 {'n_estimators': 100}
5.638049536036248 {'n_estimators': 200}
  • 최종 모델 예측
final_model = grid_search.best_estimator_
final_predictions = final_model.predict(X_test)
final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
final_rmse
4.953541332926305
final_model.score(X_test, y_test)
0.9041542560167558

좋은 웹페이지 즐겨찾기