⌛1st Try

처음에 데이터를 봤을 때 원래 하던 데로 이상치를 중앙값으로 대체하였다. 그리고 데이터의 상관관계가 매우 높고 연속형 변수가 많아서 PCA를 적용하면 좋을 것이라고 생각하였다. 또한, 데이터가 왜도가 많이 높아서 PCA시에 표준화를 해주니까 어느정도 보완이 될 것이라 생각하였다. 주요 변수가 3개 정도 나왔고 설명되는 분산도 90%이상이어서 잘 된 줄 알았다. 모델은 XGB단일모델로 써보았다.
그런데, 성능이 매우 안좋았다...
👉0.1959

⌛2nd Try

그래서 이번에는 PCA를 제외하고 다시 돌려보았다. 성능이 훨씬 더 향상이 되었다.
👉0.1664

⌛3rd Try

목표를 상위10%에 드는 것으로 잡고 해야할 일들을 적어 보았다.

데이콘 노트에서 타 참여자꺼를 참고하여 다시 이상치 제거하기
Pycaret 활용하여 여러가지 모델 앙상블 하기
라벨 인코딩으로 변경
데이터 관련한 인사이트

데이터 살펴보기

id : 샘플 아이디
👉불필요.

Gender : 전복 성별
👉 m,f,i가 있는데 i는 아직 모르는 상태인듯하다.

Lenght : 전복 길이

Diameter : 전복 둘레

Height : 전복 키

Whole : Weight : 전복 전체 무게

Shucked Weight : 껍질을 제외한 무게

Viscra Weight : 내장 무게

Shell Weight : 껍질 무게
👉 껍질 무게랑 껍질을 제외한 무게를 더했을 때, 전체 무게와 유사하게 나와야 하는데, 차이가 크다. 데이콘에 어떤 분이 해석해 놓으신 것처럼 물이 빠져서 그럴 수도 있을 것 같다.

Target : 전복 나이

컬럼 명 바꾸기

length에 오타가 나 있어서 바꾸어 주었다.

aba_train.rename(columns = {"Lenght":"Length"}, inplace = True)
aba_test.rename(columns = {"Lenght":"Length"}, inplace = True)

EDA

aba_df.hist(figsize = (14,10),bins =20)


xcopy = aba_df.copy()
xcopy.drop(['id'],axis=1,inplace=True)
xcopy.dropna(inplace=True)

sns.set(font_scale = 1, rc = {'figure.figsize':(12,12)})
ax = sns.heatmap(xcopy.corr(), annot = True, fmt = ".2f", linewidths = 1, cmap="crest")
buttom, top = ax.get_ylim()
ax.set_ylim(buttom + 0.5, top - 0.5)

sns.pairplot(aba_df, vars = ["Lenght","Diameter","Height","Whole Weight","Shucked Weight","Viscra Weight", "Shell Weight"])

fig, ax = plt.subplots(3, 3, figsize = (14, 14))

plt.suptitle("feature / target", fontsize=40)

feature = ["Length", "Diameter", "Height", "Whole Weight", "Shucked Weight", "Viscra Weight", "Shell Weight", "Gender"]

xvalue = 0
yvalue = 0

for xstr in feature:
    if xstr == "Gender":
        sns.violinplot(x = xstr, y = 'Target', data = aba_train)
    else:
        sns.scatterplot(x = xstr, y = 'Target', data = aba_train, ax = ax[yvalue][xvalue])
        
    if xvalue == 2:
        xvalue = 0
        yvalue += 1
    else:
        xvalue += 1

👉 변수간 상관관계가 높다
왜도가 심하다
타겟 변수와 목적 변수들의 관계가 크다
혼자 튀는 이상치가 하나 존재한다
이상치는 제거하여야 하고 상관관계가 큰 부분을 해결하려고 표준화와 PCA를 진행하였지만 성능이 저하되었었기에 이번에는 그냥 진행하기로 결정하였다.

이상치 제거

이상치는 데이콘 코드 공유를 참고하여 제거하였다.

aba_train[aba_train["Target"]>20]
aba_train = aba_train.drop(index = [762], axis = 0)
aba_train = aba_train.drop(index = [47,382,435,847,1078], axis = 0)
#둘레가 길이나 높이보다 작을 수는 없다.
aba_train[aba_train["Length"] > aba_train["Diameter"]]
aba_train[aba_train["Height"] > aba_train["Diameter"]]
#내장 무게가 전복 본체의 무게보다 클 수 는 없다.
aba_train[aba_train["Shucked Weight"] < aba_train["Viscra Weight"]]
aba_train = aba_train.drop(index = [465], axis = 0)
#모양 상 길이보다 높이가 클 수 없다.
aba_train[aba_train["Length"] < aba_train["Height"]]

인코딩

스터디에 다른 분이 원핫했을 때, 성능이 떨어진다 하셔서 라벨로 바꾸었다.

le = LabelEncoder()
le.fit(aba_train["Gender"])
aba_train["Gender"] = le.transform(aba_train["Gender"])

모델링 - pycaret 써보기

pycaret 쓸 때, 꿀팁은 catboost가 없어서 !pip install pycaret[full]로 해줘야 한다는 건데, 나는 왠지 에러나서 carboost만 다시 설치해줬다.
pycaret 에러날때#!pip install delayed 이거 깔고 다시 시작 하면 잘 되더라

#!pip install pycaret
#!pip install catboost
#!pip install delayed
aba_train.drop('id',axis=1)
aba_test.drop('id',axis=1)
from pycaret.regression import *

reg = setup(aba_train, target = 'Target',numeric_features=list(aba_train.drop(columns = ['Target']).columns))
best = compare_models(sort = 'MAE')
#상위 3개만
lar = create_model('lar', cross_validation = False)
lr = create_model('lr', cross_validation = False)
et = create_model('et', cross_validation = False)
#하이퍼파라미터 튜닝
tuned_lar = tune_model(lar, optimize = 'MAE', n_iter = 10)
tuned_lr = tune_model(lr, optimize = 'MAE', n_iter = 10)
tuned_et = tune_model(et, optimize = 'MAE', n_iter = 10)
#블렌딩
blend = blend_models(estimator_list = [tuned_lar,tuned_lr,tuned_et], optimize = 'MAE')
voting = finalize_model(blend)
sample_submission = pd.read_csv('/content/drive/MyDrive/전복나이예측/sample_submission.csv')
pred = voting.predict(aba_test)
pred = pred.astype(int)
sample_submission['Target'] = pred
sample_submission.to_csv('try5.csv', index=False)

아 그리고 마지막에 꼭!!! 정수로 변환을 해줘야 한다. 타겟 변수가 정수여서 그냥 실수로 했더니 오차가 어마어마...이거 때문에 성능이 안나왔던 것 같다.

👉0.1524
이렇게 하니까 좀 정상적인 점수대가 나왔다.

⌛4th Try

오버샘플링
Range별로 나눠보기
feature 추가

다른 분의 코드 공유를 보니 전체 무게 = 전복 살 무게+내장무게+껍질무게라는 정보를 얻었다. 그래서 그것을 바탕으로 이상치를 처리해주고 물이나 피의 무게 = 전체 무게 -(살+내장+껍질)로 해줘서 feature도 바꿔주었다.

#새로운 feature 추가
aba_train["Shell Water"] = aba_train["Whole Weight"] - (aba_train["Shucked Weight"] + aba_train["Shell Weight"]+aba_train["Viscra Weight"])
aba_train["Whole Weight"] = aba_train["Whole Weight"] - aba_train["Shell Water"]
aba_train = aba_train.drop(columns = ['Shell Water', 'id'], axis = 1)

aba_test["Shell Water"] = aba_test["Whole Weight"] - (aba_test["Shucked Weight"] + aba_test["Shell Weight"]+aba_test["Viscra Weight"])
aba_test["Whole Weight"] = aba_test["Whole Weight"] - aba_test["Shell Water"]
aba_test = aba_test.drop(columns = ['Shell Water', 'id'], axis = 1)

나이를 range별로 나눠서 smote로 오버샘플링을 했는데 더 성능이 안좋아져서 없앴다.

from imblearn.over_sampling import SMOTE
aba_train_T=aba_train['Target']
target=[]
for i in range(len(aba_train_T)):
    if aba_train_T.iloc[i] < 10:
        target.append(1)
    elif aba_train_T.iloc[i] < 15:
        target.append(2)
    else:
        target.append(3)
        
aba_train["Range"] = target
sns.countplot(x = aba_train["Range"])
aba_train_rd = aba_train.drop(columns = ['Range'], axis = 1)
aba_train_rd.head()
aba_train_r = aba_train['Range']
aba_train_r.head()
oversample = SMOTE(random_state=123)
x_over, y_over = oversample.fit_resample(aba_train_rd, aba_train_r)
print(y_over.value_counts())

👉0.1485
14점대로 내려올 수 있었다.

Reference

https://dacon.io/competitions/official/235877/codeshare/4711?page=1&dtype=recent
https://today-1.tistory.com/17
https://dacon.io/competitions/official/235745/codeshare/2958
https://dacon.io/competitions/official/235877/codeshare/4714?page=1&dtype=recent

Author And Source

이 문제에 관하여([데이콘] 전복나이예측경진대회), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@macang15/데이콘-전복나이예측경진대회

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

[데이콘] 전복나이예측경진대회