[Coursera] How to win a data science competition - 4주차 3강

4566 단어 kaggle machine learning Coursera Coursera

1. Ensemble Method

더 강력한 예측을 얻기 위해 다양한 기계 학습 모델을 조합하는 것.
간단한 평균의 방법부터 시작하여 여러 가중 평균의 방법 존재

2. Bagging

Means averaging slightly different versions of the same model to improve accuracy

(1) Why Bagging?
: Errors due to Bias(Underfitting) and Variance(Overfitting) exist

(2) Parameters that control bagging
: Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism

(3) Example of bagging

# train is the training data
# test is the test data
# y is target variable

model = RandomForestRegressor()
bags = 10
seed = 1

bagged_prediction = np.zeros(test.shape[0])

for n in range(0,bags):
	model.set_params(random_state = seed+n) # update seed
    model.fit(train.y)
    preds = model.predict(test)
    bagged_prediction +=preds
# take average of predictions
bagged_prediction/= bags

3. Boosting

Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
= 이전 모델의 성능을 고려하여 각 모델이 순차적으로 만들어지는 모델의 가중 평균의 형식

(1) Weight based boosting
특정한 법칙에 따라 weight를 만들고, weight를 feature의 하나로 추가

Learning rate
Number of estimators
Input model - can be anything that accepts weights
Sub boosting type : AdaBoost, LogitBoost

(2) Residual based boosting
특정 법칙에 따라 error를 계산하고 y label을 Old_prediction에 따라 새로 정할 것

Learning rate
Number of estimators
Row sampling
Column sampling
Input model - better be trees
Sub boosting type : Fully gradient based, Dart

XGBoost, LightGBM, H20'S GBM, CatBoost 등 지배적인 알고리즘에서 사용하는 방법!

4. Stacking

Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
예측 모델 부분에서 가장 인기 있는 형태이자 마지막 단계에서 대체로 사용되는 방식

() Stacking Example

from sklearn.ensemble import RandomForestRegressor

training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)

model1 = RandomForestRegressor()
model2 = LinearRegression()

model1.fit(training, ytraining)
model2.fit(training, ytraining)

preds1 = model1.predict(valid)
preds2 = model2.predict(valid)

test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)

stacked_prediction = np.column_stack(preds1,preds2)
stacked_test_prediction = np.column_stack(test_preds1, test_preds2)

#specifiy meta model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)

() Things to consider

With time sensitive data- respect time
Diversity as important as performance
Diversity may come from different algorithms or different input features
Performance plateauing after N models
Meta model is normally modest

5. StackNet

Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
스태킹을 사용하여 여러 모델을 여러 라벨의 NN에 결합하는 확장가능한 메타 모델링 방법론

6. Tips and Tricks

(1) 1st level tips

2-3 gradient boosted trees (lightgbm, xgboost, catboost)
2-3 Neural Net (keras, pytorch)
1-2 ExtraTrees/Random Forest
1-2 linear models, SVM
1-2 KNN
1 factorization machine (libfm)
1 SVM with nonlinear kernel if size/memory allows

(2) subsequent level tips

1) simpler algorithms

gradient boosted trees with small depth like 2-3
linear models with high regularization
extra trees
shallow networks
KNN with braycurtis distance
Brute forcing a search for best linear weights based on CV

2) Feature engineering

pairwise differences between meta features
row-wise statistics like avg or std
standard feature selection techniques

3) Be mindful of target leakage!

Author And Source

이 문제에 관하여([Coursera] How to win a data science competition - 4주차 3강), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@jhbale11/Coursera-How-to-win-a-data-science-competition-4주차-3강

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

[오리지널] MFC에서 메뉴 항목 동적 추가 및 응답 메뉴 이벤트

링크 고전 문제 집계

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다