Decision Tree to Classify Human Activity

Today, we will be using decision tree to perform classification prediction on UCI Machine Learning Repository's Human Activity Recognition dataset. The dataset is a compilation of various feature data collected after installing smartphone sensor to 30-participants. Based on the collected feature-set we will classify the type of action by each participant.

UCI Repository

After clicking Data Folder download UCI HAR Dataset.zip and change its name to human_activity.

Now that we have downloaded the dataset, we will begin the preprocessing of the data.

Input

# UCI Human Activity Recognition Using Smartphones
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

features_name_df = pd.read_csv('./human_activity/features.txt', sep='\s+',
                               header=None, names=['column_index', 'column_name'])

feature_name = feature_name_df.iloc[:, 1].values.tolist()

print('Extract 10 Feature Names : \n', feature_name[:10])

Output

Extract 10 Feature Names : 
 ['tBodyAcc-mean()-X', 'tBodyAcc-mean()-Y', 'tBodyAcc-mean()-Z', 'tBodyAcc-std()-X', 'tBodyAcc-std()-Y', 'tBodyAcc-std()-Z', 'tBodyAcc-mad()-X', 'tBodyAcc-mad()-Y', 'tBodyAcc-mad()-Z', 'tBodyAcc-max()-X']

Now let us check whether or not there are duplicate column names.

Input

feature_dup_df = feature_name_df.groupby('column_name').count()

print(feature_dup_df[feature_dup_df['column_index'] > 1].count())
feature_dup_df[feature_dup_df['column_index'] > 1].head()

Output

column_index    42
dtype: int64

There are total 42 duplicate feature-names, we will simply add _1 or _2 to initial feature names to create distinctive values.

Input

def get_new_feature_name_df(old_feature_name_df):
    feature_dup_df = pd.DataFrame(data=old_feature_name_df.groupby('column_name').cumcount(),
                                     columns=['dup_cnt'])

    feature_dup_df = feature_dup_df.reset_index()

    new_feature_name_df = pd.merge(old_feature_name_df.reset_index(), feature_dup_df, how='outer')
    new_feature_name_df['column_name'] = new_feature_name_df[['column_name', 'dup_cnt']].apply(lambda x: x[0]+'_'+str(x[1])
                                                                                                  if x[1] > 0 else x[0], axis=1)

    new_feature_name_df = new_feature_name_df.drop(['index'], axis=1)
    
    return new_feature_name_df

Now, we will create a function to manage dataset.

Input

import pandas as pd

def get_human_dataset( ):
    
    feature_name_df = pd.read_csv('./human_activity/features.txt',sep='\s+',
                        header=None,names=['column_index','column_name'])
    
    new_feature_name_df = get_new_feature_name_df(feature_name_df)
    
    feature_name = new_feature_name_df.iloc[:, 1].values.tolist()
    
    X_train = pd.read_csv('./human_activity/train/X_train.txt',sep='\s+', names=feature_name )
    X_test = pd.read_csv('./human_activity/test/X_test.txt',sep='\s+', names=feature_name)
    
    y_train = pd.read_csv('./human_activity/train/y_train.txt',sep='\s+',header=None,names=['action'])
    y_test = pd.read_csv('./human_activity/test/y_test.txt',sep='\s+',header=None,names=['action'])
    
    return X_train, X_test, y_train, y_test


X_train, X_test, y_train, y_test = get_human_dataset()

Input

print('### Train Dataset Info \n')
print(X_train.info())

Output

### Train Dataset Info 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7352 entries, 0 to 7351
Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean)
dtypes: float64(561)
memory usage: 31.5 MB
None

There are total 561 float-type feature data which do not need any categorical encoding.

Input

print(y_train['action'].value_counts())

Output

6    1407
5    1374
4    1286
1    1226
2    1073
3     986
Name: action, dtype: int64

Label data are fairly distributed without any bias.

Now, we will use DecisionTreeClassifier to classify the activity. We will set the hyper-parameters to default value and extract those afterwards.

Input

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt_clf = DecisionTreeClassifier(random_state=156)
dt_clf.fit(X_train, y_train)

pred = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred)
print('Accuracy of DecisionTreeClassifier : {0:4f} \n'.format(accuracy))

# extract hyper parameter of DecisionTreeClassifier
print('DecisionTreeClassifier Hyper Parameter : \n', dt_clf.get_params())

Output

Accuracy of DecisionTreeClassifier : 0.854768 

DecisionTreeClassifier Hyper Parameter : 
 {'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'presort': False, 'random_state': 156, 'splitter': 'best'}

The result has 85.48% of accuracy. Now we will see how the tree-depth of Decision Tree influences the accuracy of the model. We studied previously that the tree extends its depth until the tree satisfies the condition. Hence, we will gradually change the maximum depth of the tree to and observe the shift in accuracy.

Input

from sklearn.model_selection import GridSearchCV

params = {
    'max_depth' : [6,8,10,12,16,20,24]
}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)

print('GridSearchCV Best Accuracy Score : {0:4f}'.format(grid_cv.best_score_))
print('GridSearchCV Best Hyperparamter : ', grid_cv.best_params_)

Output

GridSearchCV Best Accuracy Score : 0.852557
GridSearchCV Best Hyperparamter :  {'max_depth': 8}

GridSearchCV returns that the best hyper parameter is at maximum depth of 8 with accuracy of 85.26%. We will see how the accuracy changed based on the change in CV set.

Using GridSearchCV's cvresults, we will extract the mean test score of each steps.

Input

cv_results_df = pd.DataFrame(grid_cv.cv_results_)

cv_results_df[['param_max_depth', 'mean_test_score']]

Using the result from GridSearchCV, we will estimate the accuracy of the Decision Tree in separate train-set.

Input

max_depths = [6,8,10,12,16,20,24]

for depth in max_depths:
    dt_clf = DecisionTreeClassifier(max_depth=depth, random_state=156)
    dt_clf.fit(X_train, y_train)
    
    pred = dt_clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred)
    
    print('{0} Max Depth Accuracy {1:4f}'.format(depth, accuracy))

Output

6 Max Depth Accuracy 0.855786
8 Max Depth Accuracy 0.870716
10 Max Depth Accuracy 0.867323
12 Max Depth Accuracy 0.864608
16 Max Depth Accuracy 0.857482
20 Max Depth Accuracy 0.854768
24 Max Depth Accuracy 0.854768

As we have expected, maximum depth at 8 returns the highest accuracy score of 87.07%.

Now, let us also try tuning the accuracy by changing minimum amount of samples to split the leaf-node.

Input

params = {
    'max_depth' : [8,12,16,20],
    'min_samples_split' : [16,24],
}

grid_cv = GridSearchCV(dt_clf, param_grid=params, scoring='accuracy', cv=5, verbose=1)
grid_cv.fit(X_train, y_train)

print('Grid Search CV Best Accuracy : {0:4f}'.format(grid_cv.best_score_))
print('Grid Search Best Hyper Parameter : ', grid_cv.best_params_)

Output

Grid Search CV Best Accuracy : 0.855005
Grid Search Best Hyper Parameter :  {'max_depth': 8, 'min_samples_split': 16}

Now, we will use identical trained estimator to perform classification in separate test-set.

Input

best_dt_clf = grid_cv.best_estimator_
pred1 = best_dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, pred1)

print('Decision Tree Classifier Accuracy {0:4f}'.format(accuracy))

Output

Decision Tree Classifier Accuracy 0.871734

Now using barplot, we will visualize the importances of feature in the model.

Input

import seaborn as sns

ftr_importances_values = best_df_clf.feature_importances_

ftr_importances = pd.Series(ftr_importances_values, index=X_train.columns)
ftr_top20 = ftr_importances.sort_values(ascending=False)[:20]

plt.figure(figsize=(8,6))
plt.title('Feature Importances Top 20')

sns.barplot(x=ftr_top20, y=ftr_top20.index)
plt.show()

Output

좋은 웹페이지 즐겨찾기