machine learning/Concrete Compressive Strength Dataset

콘크리트 데이터셋 도전

UCI Machine Learning Repository의
Concrete Compressive Strength Data Set에서 Concrete_Data.xls를 다운로드

1. 데이터 탐색 및 전처리

데이터 적재

import pandas as pd
concrete = pd.read_excel('datasets/Concrete_Data.xls')
concrete

	Cement (component 1)(kg in a m^3 mixture)	Blast Furnace Slag (component 2)(kg in a m^3 mixture)	Fly Ash (component 3)(kg in a m^3 mixture)	Water (component 4)(kg in a m^3 mixture)	Superplasticizer (component 5)(kg in a m^3 mixture)	Coarse Aggregate (component 6)(kg in a m^3 mixture)	Fine Aggregate (component 7)(kg in a m^3 mixture)	Age (day)	Concrete compressive strength(MPa, megapascals)
0	540.0	0.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366
2	332.5	142.5	0.0	228.0	0.0	932.0	594.0	270	40.269535
3	332.5	142.5	0.0	228.0	0.0	932.0	594.0	365	41.052780
4	198.6	132.4	0.0	192.0	0.0	978.4	825.5	360	44.296075
...	...	...	...	...	...	...	...	...	...
1025	276.4	116.0	90.3	179.6	8.9	870.1	768.3	28	44.284354
1026	322.2	0.0	115.6	196.0	10.4	817.9	813.4	28	31.178794
1027	148.5	139.4	108.6	192.7	6.1	892.4	780.0	28	23.696601
1028	159.1	186.7	0.0	175.6	11.3	989.6	788.9	28	32.768036
1029	260.9	100.5	78.3	200.6	8.6	864.5	761.5	28	32.401235

1030 rows × 9 columns

Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable
Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable
Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable
Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable
Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable
Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable
Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable
Age -- quantitative -- Day (1~365) -- Input Variable
Concrete compressive strength -- quantitative -- MPa -- Output Variable
columns를 단순한 이름으로 변경

concrete.columns = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg', 'fineagg', 'age', 'strength']

앞의 2행 확인하기

concrete.head(2)

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
0	540.0	0.0	0.0	162.0	2.5	1040.0	676.0	28	79.986111
1	540.0	0.0	0.0	162.0	2.5	1055.0	676.0	28	61.887366

뒤에 2행 확인하기

concrete.tail(2)

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
1028	159.1	186.7	0.0	175.6	11.3	989.6	788.9	28	32.768036
1029	260.9	100.5	78.3	200.6	8.6	864.5	761.5	28	32.401235

통계치 살펴보기

concrete.describe()

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
count	1030.000000	1030.000000	1030.000000	1030.000000	1030.000000	1030.000000	1030.000000	1030.000000	1030.000000
mean	281.165631	73.895485	54.187136	181.566359	6.203112	972.918592	773.578883	45.662136	35.817836
std	104.507142	86.279104	63.996469	21.355567	5.973492	77.753818	80.175427	63.169912	16.705679
min	102.000000	0.000000	0.000000	121.750000	0.000000	801.000000	594.000000	1.000000	2.331808
25%	192.375000	0.000000	0.000000	164.900000	0.000000	932.000000	730.950000	7.000000	23.707115
50%	272.900000	22.000000	0.000000	185.000000	6.350000	968.000000	779.510000	28.000000	34.442774
75%	350.000000	142.950000	118.270000	192.000000	10.160000	1029.400000	824.000000	56.000000	46.136287
max	540.000000	359.400000	200.100000	247.000000	32.200000	1145.000000	992.600000	365.000000	82.599225

누락 정보 확인하기

concrete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

concrete.isnull().sum(0)

cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64

각 열의 평균값 확인하기

concrete.mean()

cement          281.165631
slag             73.895485
ash              54.187136
water           181.566359
superplastic      6.203112
coarseagg       972.918592
fineagg         773.578883
age              45.662136
strength         35.817836
dtype: float64

slag 열의 데이터 분포 확인하기

concrete["slag"] #concrete.slag

0         0.0
1         0.0
2       142.5
3       142.5
4       132.4
        ...  
1025    116.0
1026      0.0
1027    139.4
1028    186.7
1029    100.5
Name: slag, Length: 1030, dtype: float64

위의 수치 데이터로는 특잇값을 한눈에 알아보기 어려우므로 plot으로 확인

concrete["slag"].plot()

<AxesSubplot:>

slag 데이터 값 중 300을 넘어가는 값이 확인됨
value_counts를 통해 초과되는값이 10개 있음을 알 수 있음

concrete["slag"]>300

0       False
1       False
2       False
3       False
4       False
        ...  
1025    False
1026    False
1027    False
1028    False
1029    False
Name: slag, Length: 1030, dtype: bool

(concrete["slag"]>300).value_counts()

False    1020
True       10
Name: slag, dtype: int64

boxplot으로 보다 편리하게 outlier 확인하기

from IPython.display import Image
Image(filename='boxplot.png', width=400)

boxplot() 함수는 이상치(플라이어(flier)라고 부름)를 감지하고 수염 부분에 이를 포함시키지 않음
1사분위가 𝑄1 이고 3사분위가 𝑄3 이라면 사분위수 범위는 𝐼𝑄𝑅=𝑄3−𝑄1 (이 값이 박스의 높이)
𝑄1−1.5×𝐼𝑄𝑅 보다 낮거나 𝑄3+1.5×𝐼𝑄𝑅 보다 높은 점수는 이상치로 간주

import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
plt.boxplot(concrete["slag"])

{'whiskers': [<matplotlib.lines.Line2D at 0x202cd7c23a0>,
  <matplotlib.lines.Line2D at 0x202cd7c2700>],
 'caps': [<matplotlib.lines.Line2D at 0x202cd7c2a60>,
  <matplotlib.lines.Line2D at 0x202cd7c2dc0>],
 'boxes': [<matplotlib.lines.Line2D at 0x202cd7c2040>],
 'medians': [<matplotlib.lines.Line2D at 0x202cd7ce160>],
 'fliers': [<matplotlib.lines.Line2D at 0x202cd7ce4c0>],
 'means': []}

모든 특성에 대해 outlier 확인 (seaborn 의 boxplot 이용)

concrete.columns

Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')

import seaborn as sns

for i in concrete.columns:
    plt.figure(figsize=(5,3))
    sns.boxplot(data=concrete[i])
    plt.title("{} distribution".format(i))

names = list(concrete.columns)

fig, axes = plt.subplots(len(names), 1, figsize=(4, 30), constrained_layout=True)

for i in range(len(names)):
    sns.boxplot(data = concrete[names[i]], ax=axes[i]).set_title(names[i] + ' distribution')

아래와 같이 outlier를 제외한 부분만 슬라이싱

q1 = concrete['slag'].quantile(0.25)
q3 = concrete['slag'].quantile(0.75)
iqr = q3-q1
n=(concrete['slag']<q1-1.5*iqr)|(concrete['slag']>q3+1.5*iqr)
concrete[n]

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
553	239.6	359.4	0.0	185.7	0.0	941.6	664.3	7	25.422359
559	239.6	359.4	0.0	185.7	0.0	941.6	664.3	28	39.435269

concrete[n]

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
553	239.6	359.4	0.0	185.7	0.0	941.6	664.3	7	25.422359
559	239.6	359.4	0.0	185.7	0.0	941.6	664.3	28	39.435269

아래와 같이 outlier를 제외한 부분만 슬라이싱 (전체 특성)

x = concrete.iloc[:,:-1]

q1 = x.quantile(0.25)
q3 = x.quantile(0.75)

iqr = q3 - q1

concrete.iloc[:,:-1] = x[(x >= q1-1.5*iqr) & (x <= q3+iqr*1.5)]
#concrete= concrete.fillna(concrete.median()) # 대체하고 싶은 경우

concrete.dropna(inplace=True)

아웃라이어들이 제거 되었음을 확인

concrete.describe()

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
count	945.000000	945.000000	945.00000	945.000000	945.000000	945.000000	945.000000	945.000000	945.000000
mean	276.255714	74.192011	57.77291	180.526561	6.262228	974.975291	776.264868	32.453968	35.083287
std	102.121900	86.580718	64.33716	18.789734	5.356791	77.688240	75.092246	28.572875	16.827756
min	102.000000	0.000000	0.00000	127.000000	0.000000	801.000000	594.000000	1.000000	2.331808
25%	190.340000	0.000000	0.00000	164.900000	0.000000	932.000000	739.300000	7.000000	22.718234
50%	260.000000	20.000000	0.00000	183.800000	7.000000	968.000000	779.700000	28.000000	33.418902
75%	349.000000	145.000000	118.31000	192.000000	10.300000	1033.000000	822.200000	28.000000	45.084836
max	540.000000	342.100000	200.10000	228.000000	22.100000	1145.000000	945.000000	120.000000	82.599225

중복 데이터 확인하고 삭제하기

concrete.duplicated().sum()

concrete.duplicated().value_counts()

False    920
True      25
dtype: int64

concrete.drop_duplicates(inplace=True)

25개 삭제 되었음을 확인

concrete.describe()

	cement	slag	ash	water	superplastic	coarseagg	fineagg	age	strength
count	920.000000	920.000000	920.000000	920.000000	920.000000	920.000000	920.000000	920.000000	920.000000
mean	273.351359	72.176576	59.342826	181.053261	6.076527	976.623750	775.363152	32.307609	34.443325
std	101.730901	86.475855	64.487271	18.726740	5.274854	77.450242	75.175200	28.447534	16.344159
min	102.000000	0.000000	0.000000	127.000000	0.000000	801.000000	594.000000	1.000000	2.331808
25%	190.340000	0.000000	0.000000	166.610000	0.000000	932.000000	736.900000	7.000000	22.435549
50%	253.250000	19.000000	0.000000	185.000000	6.710000	971.800000	780.000000	28.000000	33.202062
75%	337.900000	143.250000	118.700000	192.000000	10.000000	1040.000000	821.550000	28.000000	44.305728
max	540.000000	342.100000	200.100000	228.000000	22.100000	1145.000000	945.000000	120.000000	82.599225

타겟 데이터를 제외한 특성 데이터만으로 plot 만들기
(그래프를 통해 각 특성들의 범위가 다름을 알 수 있음)

cnum = concrete.iloc[:,:-1]

cnum.plot()

<AxesSubplot:>

정규화, 표준화 작업하기

import numpy as np

def minmax_normalize(arr):
    return (arr-arr.min())/(arr.max()-arr.min())

def zscore_standize(arr):
    return (arr - arr.mean())/arr.std()

DataFrame의 apply 함수를 통해 정규화(normalize)
데이터가 0과 1사이의 범위로 변경되었음을 확인

concrete_mm = cnum.apply(minmax_normalize)

concrete_mm.plot()

<AxesSubplot:>

DataFrame의 apply 함수를 통해 표준화(standarize)
데이터의 평균이 0으로 맞추어졌음을 확인

concrete_zs = cnum.apply(zscore_standize)

concrete_zs.plot()

<AxesSubplot:>

2. 모델 훈련

특성과 레이블 준비
훈련세트와 테스트세트 분리
linearRegresssion 모델 훈련 (추가적으로 DecisionTree, RandomForest 훈련)
훈련 데이터로 성능 평가 - RMSE
교차검증 (모델간 성능 비교)
그리드 탐색 RandomForest : n_estimators (50, 100,200) 파라미터 탐색
최적 모델로 예측 - 테스트 세트

cross_val_score?

from sklearn.model_selection import train_test_split

X = concrete_zs
y = concrete.strength

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

LinearRegression 훈련

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

LinearRegression()

predict = lin_reg.predict(X_train)

from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(y_train, predict, squared=False)
lin_rmse

7.8871439080093575

DecisionTree 훈련

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state = 42)
tree_reg.fit(X_train, y_train)

DecisionTreeRegressor(random_state=42)

predict = tree_reg.predict(X_train)
tree_rmse = mean_squared_error(y_train, predict, squared=False)
tree_rmse

0.8704077845616891

RandomForest 훈련

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

RandomForestRegressor(random_state=42)

predict = forest_reg.predict(X_train)
forest_rmse = mean_squared_error(y_train, predict, squared=False)
forest_rmse

2.103342396019283

교차 검증

from sklearn.model_selection import cross_val_score

lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
lin_scores = np.sqrt(-lin_scores)
lin_scores.mean()

7.942955496320373

tree_scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
tree_scores = np.sqrt(-tree_scores)
tree_scores.mean()

7.492408361697325

forest_scores = cross_val_score(forest_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=5)
forest_scores = np.sqrt(-forest_scores)
forest_scores.mean()

5.605067301171005

그리드 탐색 (세부튜닝)

from sklearn.model_selection import GridSearchCV

param_grid = [{'n_estimators' : [50, 100, 200]}]

forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs = -1) # 3 * 5

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42), n_jobs=-1,
             param_grid=[{'n_estimators': [50, 100, 200]}],
             scoring='neg_mean_squared_error')

grid_search.best_params_

{'n_estimators': 200}

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

5.6762297701241184 {'n_estimators': 50}
5.653618133722385 {'n_estimators': 100}
5.638049536036248 {'n_estimators': 200}

최종 모델 예측

final_model = grid_search.best_estimator_

final_predictions = final_model.predict(X_test)

final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
final_rmse

4.953541332926305

final_model.score(X_test, y_test)

0.9041542560167558


                
                    
        
    
    
    
    
    
                
                

                
                
                    
                        
                            

                            
                            Author And Source
                            


                            이 문제에 관하여(machine learning/Concrete Compressive Strength Dataset), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다
                                
                                https://velog.io/@bbkyoo/machine-learningConcrete-Compressive-Strength-Dataset
                            

                            
                            
                                저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
                            
                            
                                
                                
                                

                                
                                

                                우수한 개발자 콘텐츠 발견에 전념
                                (Collection and Share based on the CC Protocol.)

machine learning/Concrete Compressive Strength Dataset

콘크리트 데이터셋 도전

1. 데이터 탐색 및 전처리

2. 모델 훈련

Author And Source

좋은 웹페이지 즐겨찾기