Kaggle Learn: Intro to Machine Learning 레코드

4452 단어 MachineLearning
kaggle에서 learn 섹션 intro to machine learning의 필기를 했고 3개의 ML 수업을 이어받았습니다. exercise가 있어서 초보자가 독학하기에 적합합니다.https://www.kaggle.com/learn/intro-to-machine-learning
import pandas as pd
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

We’ll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

예측은 price By convention,this data is called X.
X = melbourne_data[melbourne_features]
X.describe()
X.head(n) #    5 

The steps to building and using a model are:
  • Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
  • Fit: Capture patterns from provided data. This is the heart of modeling.
  • Predict: Just what it sounds like
  • Evaluate: Determine how accurate the model’s predictions are.

  • Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.
    from sklearn.tree import DecisionTreeRegressor
    #Define model. Specify a number for random_state to ensure same results each run
    melbourne_model = DecisionTreeRegressor(random_state=1)
    #Fit model
    melbourne_model.fit(X, y)
    
    print("Making predictions for the following 5 houses:")
    print(X.head())
    print("The predictions are")
    print(melbourne_model.predict(X.head()))
    

    Making predictions for the following 5 houses: Rooms Bathroom Landsize Lattitude Longtitude 1 2 1.0 156.0 -37.8079 144.9934 2 3 2.0 134.0 -37.8093 144.9944 4 4 1.0 120.0 -37.8072 144.9941 6 3 2.0 245.0 -37.8024 144.9993 7 2 1.0 256.0 -37.8060 144.9954 The predictions are [1035000. 1465000. 1600000. 1876000. 1636000.] MAE: Mean Absolute Error 모델이 정확한지 확인
    from sklearn.metrics import mean_absolute_error
    predicted_home_prices = melbourne_model.predict(X)
    mean_absolute_error(y, predicted_home_prices)
    

    Sklearn train_test_split() https://www.cnblogs.com/bonelee/p/8036024.html
    from sklearn.model_selection import train_test_split
    
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
    #Define model
    melbourne_model = DecisionTreeRegressor()
    #Fit model
    melbourne_model.fit(train_X, train_y)
    #get predicted prices on validation data
    val_predictions = melbourne_model.predict(val_X)
    print(mean_absolute_error(val_y, val_predictions))
    

    training data로 fit하고 val로 predict When we divide the houses amongst many leaves, we also have fewer houses in each leaf.Leaves with very few houses will make predictions that are quite close to those homes’ actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
    This is a phenomenon called overfitting, where a model matches the training data almost perfectly, but does poorly in validation and other new data. On the flip side, if we make our tree very shallow, it doesn’t divide up the houses into very distinct groups.
    At an extreme, if a tree divides houses into only 2 or 4, each group still has a wide variety of houses. Resulting predictions may be far off for most houses, even in the training data (and it will be bad in validation too for the same reason). When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data, that is called underfitting.
    Here’s the takeaway: Models can suffer from either:
  • Overfitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions, or
  • Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

  • We use validation data, which isn’t used in model training, to measure a candidate model’s accuracy. This lets us try many candidate models and keep the best one.

    좋은 웹페이지 즐겨찾기