Porto Seguro Exploratory Analysis and Prediction_prepare the model

5137 단어 kagglekaggle

새롭게 알게 된 사실

Stacking model이란 방식을 통해서 모델을 쌓으면 성능 향상을 이룩할 수 있다 왜냐하면, 좋은 것을 여러 개 모았기 때문입니다. 단, 연산량이 많아지는 건 주의해야합니다.

참고: 링크텍스트

*Code

#initialize the ensambing object :Very interesting point to me

stack = Ensemble(n_splits=3,
        stacker = log_model,
        base_models = (lgb_model1, lgb_model2, lgb_model3, xgb_model))  

Prepare the model

Ensable class for validation and ensamble

Spliit data in KFolds

  • train the models

  • ensemble the results

  • init method paraneters

    • self: the object to be initialized

    • n_splits: the number of cross-validation splits to be used

    • stack: the model used for stacking the prediction results from the trained base models

    • base_models: the list of base models used in training


  • fit_predict four functions

    • split the training data in n_splits folds:

    • run the base models

    • perform prediction using each model;

    • ensamble the resuls using the stacker;


Code

class Ensemble(object):
    def __init__(self, n_splits, stacker, base_models):
        self.n_splits = n_splits
        self.stacker = stacker
        self.base_models = base_models

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        folds = list(StratifiedKFold(n_splits=self.n_splits, shuffle=True, random_state=314).split(X,y))

        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):

            S_test_i = np.zeros((T.shape[0], self.n_splits))

            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]


                print("Base model %d: fit %s model | fold %d" % (i+1, str(clf).split('(')[0], j+1))
                clf.fit(X_train, y_train)
                cross_score = cross_val_score(clf, X_train, y_train, cv=3, scoring='roc_auc')
                print("cross_score[roc-auc]: %.5f [gini]: %.5f" % (cross_score.mean(), 2*cross_score.mean()-1))
                y_pred = clf.predict_proba(X_holdout)[:,1]

                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict_proba(T)[:,1]
            S_test[:, i] = S_test_i.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        #Calculate gini factor as 2 * AUC -1
        print("Stacker score[gini]: %.5f" % (2 * results.mean() -1))

        self.stacker.fit(S_train, y)
        res = self.stacker.predict_proba(S_test)[:,1]
        return res

Stacking models을 위한 작업

Parameters for the base models

Three different LightGBM and XGB model

train data : well cross-validation with 3 folds

#LightGBM params
#lgb_1

lgb_params1 = {}
lgb_params1['learning_rate'] = 0.02
lgb_params1['n_estimators'] = 650
lgb_params1['max_bin'] = 10
lgb_params1['subsample'] = 0.8
lgb_params1['subsample_freq'] = 10
lgb_params1['colsample_bytree'] = 0.8
lgb_params1['min_child_samples'] = 500
lgb_params1['seed'] = 314
lgb_params1['num_threads'] = 4

#lgb_2
lgb_params2 = {}

lgb_params2['n_estimators'] = 1000
lgb_params2['learning_rate'] = 0.02
lgb_params2['colsample_bytree'] = 0.3
lgb_params2['subsample'] = 0.7
lgb_params2['subsample_freq'] = 2
lgb_params2['num_leaves'] = 16
lgb_params2['seed'] = 314
lgb_params2['num_threads'] = 4

#lgb_3
lgb_params3 = {}
lgb_params3['n_estimators'] = 100
lgb_params3['max_depth'] = 4
lgb_params3['learning_rate'] = 0.02
lgb_params3['seed'] = 314
lgb_params3['num_threads'] = 4

#XGBoost params

xgb_params = {}
xgb_params['objective'] = 'binary:logistic'
xgb_params['learning_rate'] = 0.04
xgb_params['n_estimators'] = 490
xgb_params['max_depth'] = 4
xgb_params['subsample'] = 0.9
xgb_params['colsample_bytree'] = 0.9
xgb_params['min_child_weight'] = 10
xgb_params['num_threads'] = 4
#Initialize the models with parameters

##3base models and the stacking model
import xgboost
from xgboost import XGBClassifier
# Base models
lgb_model1 = LGBMClassifier(**lgb_params1)

lgb_model2 = LGBMClassifier(**lgb_params2)
       
lgb_model3 = LGBMClassifier(**lgb_params3)

xgb_model = XGBClassifier(**xgb_params)

# Stacking model
log_model = LogisticRegression()
#Run the predictive models
#fit_predict method of stack object
# predict the target with each model
#Ensamble the results using the stacker model and output the stacked result

y_prediction = stack.fit_predict(trainset, target_train, testset)

좋은 웹페이지 즐겨찾기