Barbell-Lifts-Classification/Machine-Learning-on-Barbell-Lifts.Rmd at gh-pages · darkychin/Barbell-Lifts-Classification · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Machine Learning on Correct and Incorrect Barbell Lifts"
author: "Darky"
date: "1/6/2020"
output: html_document
---

## Executive Summary
This project is to create a model on correct and incorrect manner of barbell lifts of the participants, and then use the model to predict the actions from the test data set.

These manner is grouped under variable name *classe*.

Since this project's main focus is machine learning,
hence process such as getting and cleaning data is sourced to other files
which can be found in this directory.

These files are:

- DownloadData.R
- LoadData.R
- TidyData.R

Data source can be found under section **Citation**.

## Prepare Data
Download data to local.
```{r eval=FALSE}
## RUN Once only after you set up your working directory
source("DownloadData.R")
```

Load data **training** and **testing** into environment.
```{r cache=TRUE,warning=FALSE}
source("LoadData.R")
```

Tidy data, and save them as **tidyTraining** and **tidyTesting**.
```{r cache=TRUE,message=FALSE,warning=FALSE}
source("TidyData.R")
```

## Import Library
```{r warning=FALSE, message=FALSE}
library(dplyr)
library(caret)
```

## Data Exploration
```{r cache=TRUE}
head(tidyTraining,3)
```

## Feature Selection
After exploring the data, column *X* and *user_name* are not confounding factors for *classe*. Hence we will drop them from **tidyTraining** and then save it as **fsTraining**.
```{r results = 'hide'}
dropCol <- c("X","user_name")
fsTraining <- tidyTraining %>% select(-one_of(dropCol))
```

According to the [research paper](http://groupware.les.inf.puc-rio.br/public/papers/2013.Velloso.QAR-WLE.pdf), **Section 5.1 Feature extraction and selection** mentioned that features such as *mean*, *variance*,*standard deviation*, *max*, *min*, *amplitude*, *kurtosis* and *skewness* are calculated from the euler angles of the four sensors. Hence, these fields can be safely dropped as they represent as the summary of the raw measurements.
```{r results = 'hide'}
pattern <- "avg_|var_|stddev_|max_|min_|amplitude_|kurtosis_|skewness_"
fsTraining <- fsTraining %>% select(-matches(pattern, ignore.case = TRUE))
```

## Splitting Data
In order to do model selection and out of sample error calculation,
we will be splitting fsTraining into 60% and 40% as **fsTrainingSub** and **fsTestingSub** respectively.
```{r}
set.seed(1337)
trainIndex <- createDataPartition(fsTraining$classe, p = .6,
                                  list = FALSE,
                                  times = 1)
fsTrainingSub <- fsTraining[trainIndex,]
fsTestingSub <- fsTraining[-trainIndex,]
```

## Model Selection
We will be using a few model to decide which model to use.

### Rpart
A simple classification tree is used to create a model.
```{r cache=TRUE}
modelRpart <- train(classe~., data = fsTrainingSub, method = "rpart")
confusionMatrix(predict(modelRpart,fsTestingSub), fsTestingSub$classe)
```
The accuracy for this model is too low, hence we will avoid using this.

### Random Forest
```{r cache=TRUE}
modelRf <- train(classe~., data = fsTrainingSub, method = "rf")
confusionMatrix(predict(modelRf,fsTestingSub), fsTestingSub$classe)
```

Since the model random forest has higher accuracy,
we will be select it as our project model.

## Out of Sample Error Rate
The out of sample error rate can be calculated as below:
```{r}
cm <- confusionMatrix(predict(modelRf,fsTestingSub), fsTestingSub$classe)
errorRate <- 1 - cm$overall["Accuracy"]
errorRate
```

## Cross Validation
Method k-fold cross validation is used to validate the built model accuracy and mitigate overfitting.

The number of folds selected are 5.

Parallel processing is also turned on in this setting to minimize model building time.

These settings are inserted into **trainControlSett** variable.
```{r}
# Configure train control setting
trainControlSett <- trainControl(method = "cv",
                              number = 5,
                              allowParallel = TRUE)
```

## Parallel Processing
Set up parallel processing.
```{r warning=FALSE, message=FALSE}
# Configure parallel processing
# Note for parallel processing: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
```

## Build Model
To have an accurate prediction for this project, we will required at least 99.9% accuracy.

Hence, the classification model selected will be randomForest,
which provides high prediction accuracy.

The data we will be using to train our model will be **fsTraining**.

The **trainControlSett** is applied in here along in the model building.

A seed is also set here to ensure reproducibility.
```{r cache=TRUE}
set.seed(1337)
model <- train(classe~., data = fsTraining, method = "rf",
               trControl = trainControlSett)
print(model)
```
From the model result, we can see that highest accuracy is achieved by having only 29 variables splitting at each tree node.

## Turn Off Parallel Processing
Deregister parallel processing cluster with code below.
```{r}
stopCluster(cluster)
registerDoSEQ()
```

## Predict Result
**tidyTesting** data set will be used to predict the 20 results of *classe* with the created model.
```{r}
predict(model,newdata = tidyTesting)
```

## Citation
Data source:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.

[Qualitative Activity Recognition of Weight Lifting Exercises](http://groupware.les.inf.puc-rio.br/work.jsf?p1=11201). Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.