[Coursera] How to win a data science competition - 4주차 3강
1. Ensemble Method
- 더 강력한 예측을 얻기 위해 다양한 기계 학습 모델을 조합하는 것.
- 간단한 평균의 방법부터 시작하여 여러 가중 평균의 방법 존재
2. Bagging
Means averaging slightly different versions of the same model to improve accuracy
(1) Why Bagging?
: Errors due to Bias(Underfitting) and Variance(Overfitting) exist
(2) Parameters that control bagging
: Changing the seed, Row sampling or Bootstrapping, Shuffling, Column sampling, Model-specific parameters, Number of models or bags, Parallelism
(3) Example of bagging
# train is the training data
# test is the test data
# y is target variable
model = RandomForestRegressor()
bags = 10
seed = 1
bagged_prediction = np.zeros(test.shape[0])
for n in range(0,bags):
model.set_params(random_state = seed+n) # update seed
model.fit(train.y)
preds = model.predict(test)
bagged_prediction +=preds
# take average of predictions
bagged_prediction/= bags
3. Boosting
Form of weighted averaging of models where each model is built sequentially via taking into account the past model performance.
= 이전 모델의 성능을 고려하여 각 모델이 순차적으로 만들어지는 모델의 가중 평균의 형식
(1) Weight based boosting
특정한 법칙에 따라 weight를 만들고, weight를 feature의 하나로 추가
- Learning rate
- Number of estimators
- Input model - can be anything that accepts weights
- Sub boosting type : AdaBoost, LogitBoost
(2) Residual based boosting
특정 법칙에 따라 error를 계산하고 y label을 Old_prediction에 따라 새로 정할 것
- Learning rate
- Number of estimators
- Row sampling
- Column sampling
- Input model - better be trees
- Sub boosting type : Fully gradient based, Dart
XGBoost, LightGBM, H20'S GBM, CatBoost 등 지배적인 알고리즘에서 사용하는 방법!
4. Stacking
Means making predictions of a number of models in a hold-out set and then using a different meta model to train on these predictions.
예측 모델 부분에서 가장 인기 있는 형태이자 마지막 단계에서 대체로 사용되는 방식
() Stacking Example
from sklearn.ensemble import RandomForestRegressor
training, valid, ytraining, yvalid = train_test_split(train, y, test_size=0.5)
model1 = RandomForestRegressor()
model2 = LinearRegression()
model1.fit(training, ytraining)
model2.fit(training, ytraining)
preds1 = model1.predict(valid)
preds2 = model2.predict(valid)
test_preds1 = model1.predict(test)
test_preds2 = model2.predict(test)
stacked_prediction = np.column_stack(preds1,preds2)
stacked_test_prediction = np.column_stack(test_preds1, test_preds2)
#specifiy meta model
meta_model = LinearRegression()
# fit meta model on stacked predictions
meta_model.fit(stacked_predictions, yvalid)
# make predictions on the stacked predictions of the test data
final_predictions = meta_model.predict(stacked_test_predictions)
() Things to consider
- With time sensitive data- respect time
- Diversity as important as performance
- Diversity may come from different algorithms or different input features
- Performance plateauing after N models
- Meta model is normally modest
5. StackNet
Scalable meta modelling methodology that utilizes stacking to combine multiple models in a neural network architecture of multiple levels
스태킹을 사용하여 여러 모델을 여러 라벨의 NN에 결합하는 확장가능한 메타 모델링 방법론
6. Tips and Tricks
(1) 1st level tips
- 2-3 gradient boosted trees (lightgbm, xgboost, catboost)
- 2-3 Neural Net (keras, pytorch)
- 1-2 ExtraTrees/Random Forest
- 1-2 linear models, SVM
- 1-2 KNN
- 1 factorization machine (libfm)
- 1 SVM with nonlinear kernel if size/memory allows
(2) subsequent level tips
1) simpler algorithms
- gradient boosted trees with small depth like 2-3
- linear models with high regularization
- extra trees
- shallow networks
- KNN with braycurtis distance
- Brute forcing a search for best linear weights based on CV
2) Feature engineering
- pairwise differences between meta features
- row-wise statistics like avg or std
- standard feature selection techniques
3) Be mindful of target leakage!
Author And Source
이 문제에 관하여([Coursera] How to win a data science competition - 4주차 3강), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@jhbale11/Coursera-How-to-win-a-data-science-competition-4주차-3강저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)