Introduction
Our goal is to predict the depth and size of a pothole based on features collecting by vehicle running over potholes such as vehicle speed,acceleration ,direction ect. The number of distinct labels Label_Combined B1,..B5 represents various increasing size and depth of potholes.
Load Required Packages
The pacman package provides a convenient way to load packages. It installs the package before loading if it not already installed.One of my favorite themes that I use with ggplot is the theme_pubclean. Here I set all themes with ggplot by it.
rm(list=ls())
#set.seed(4)
set.seed(7)
pacman::p_load(tidyverse,janitor,DataExplorer,skimr,ggpubr,viridis,kableExtra,caret,lightgbm,recipes,rsample,yardstick,pROC,xgboost,doParallel,mlr,readxl,stringr,parallel,VIM,GGally,DALEX,MLmetrics)
#Load variable importance plot
source("Varplot.R")
source("EvaluationMetrics.R")
theme_set(theme_pubclean())
Equivalently the rio package provides a unified approach to importing a variety of file formats into R. The date variable is not in a standard date format so we convert to a standard format thus separating year, month and day by ‘-‘. We can take a look at the first few rows of the data afterwards. The str_c function from the stringr package works similarly to the paste function
potholedata<- rio::import("/POTHOLEDETECTION/pothole.xlsx",sheet="Sheet6")
potholedata<-potholedata%>%mutate_if(is.character,as_factor)%>%rename(Label_Combined=Feature,Label=Feature2)
Date=paste(substr(20180709, 1, 4),substr(20180709, 5, 6),substr(20180709, 7, 8),sep = "-")
#Equivalently
Date<-str_c(substr(20180709, 1, 4),"-",substr(20180709, 5, 6),"-",substr(20180709, 7, 8))
extracdate<-function(x){
d<-str_c(str_sub(x, 1, 4),str_sub(x, 5, 6),str_sub(x, 7, 8),sep = "-")
return(d)
}
potholedata$Date<-as.Date(as_vector(purrr::map(potholedata$Date,extracdate)) )
#sapply(potholedata$Date, extracdate)
potholedata%>%head()%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Label_Combined | Label | Peak.Duration | Vehicle.Speed..mph. | Peak.Z.Accel | Peak.Lat.Accel | Peak.Long.Accel | Front.Slip | Right.Slip | Left.Slip | Direction | MLT1.Feature | MLT2.Feature | File | Date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B1 | P1 | 0.544 | 11.93148 | 1.327386 | 1.1773803 | 0.4910319 | 0.4385430 | 0.1112423 | 0.3469242 | Left3 | Type1 | N/A | Cal_EL1000Avg_1014 | 2018-07-09 |
B1 | P1 | 0.534 | 12.45331 | 1.335112 | 1.1289830 | 0.8495341 | 0.5327276 | 0.1025403 | 0.4539765 | Left1 | Type1 | N/A | Cal_EL1000Avg_1014 | 2018-07-09 |
B1 | P1 | 0.470 | 14.43407 | 1.383979 | 1.4389222 | 1.2037202 | 0.7060992 | 0.1216020 | 0.6253896 | Left1 | Type1 | N/A | Cal_EL1000Avg_1014 | 2018-07-09 |
B1 | P1 | 0.464 | 14.97096 | 1.317055 | 1.0312775 | 0.9508938 | 0.7458671 | 0.1410580 | 0.7275356 | Left1 | Type1 | N/A | Cal_EL1000Avg_1519 | 2018-07-09 |
B1 | P1 | 0.416 | 16.11388 | 1.325739 | 1.1956568 | 1.1412229 | 0.5629772 | 0.1215295 | 0.3926258 | Left3 | Type1 | N/A | Cal_EL1000Avg_1519 | 2018-07-09 |
B1 | P1 | 0.184 | 24.47751 | 1.639389 | 0.4963087 | 0.1579519 | 0.3131939 | 0.1071209 | 0.4625270 | Left3 | Hump | Crest | Cal_EL1000Avg_2024 | 2018-07-09 |
Exploratory Data Analysis
We can take a pictorial look at the variables in the dataset courtesy the plot_str from the DataExplorer package.Alternatively the glimpse function in tidyverse can also be used to inspect the variables.
plot_str(potholedata)
glimpse(potholedata)
## Observations: 280
## Variables: 15
## $ Label_Combined <fct> B1, B1, B1, B1, B1, B1, B1, B1, B1, B1, B1...
## $ Label <fct> P1, P1, P1, P1, P1, P1, P1, P1, P1, P1, P1...
## $ Peak.Duration <dbl> 0.544, 0.534, 0.470, 0.464, 0.416, 0.184, ...
## $ Vehicle.Speed..mph. <dbl> 11.93148, 12.45331, 14.43408, 14.97096, 16...
## $ Peak.Z.Accel <dbl> 1.327386, 1.335112, 1.383979, 1.317055, 1....
## $ Peak.Lat.Accel <dbl> 1.1773803, 1.1289830, 1.4389222, 1.0312775...
## $ Peak.Long.Accel <dbl> 0.4910319, 0.8495341, 1.2037202, 0.9508938...
## $ Front.Slip <dbl> 0.4385430, 0.5327276, 0.7060992, 0.7458671...
## $ Right.Slip <dbl> 0.11124231, 0.10254033, 0.12160195, 0.1410...
## $ Left.Slip <dbl> 0.3469242, 0.4539765, 0.6253896, 0.7275356...
## $ Direction <fct> Left3, Left1, Left1, Left1, Left3, Left3, ...
## $ MLT1.Feature <fct> Type1, Type1, Type1, Type1, Type1, Hump, T...
## $ MLT2.Feature <fct> N/A, N/A, N/A, N/A, N/A, Crest, N/A, N/A, ...
## $ File <fct> Cal_EL1000Avg_1014, Cal_EL1000Avg_1014, Ca...
## $ Date <date> 2018-07-09, 2018-07-09, 2018-07-09, 2018-...
The labels distribution is not severely imbalanced to create concern in evaluation of the models which shall be built later.
table(potholedata$Label)
##
## P1 P2 P3 P4 P5 P6 P7 P8 P9
## 33 57 55 32 33 22 22 13 13
#check classes distribution
prop.table(table(potholedata$Label))*100
##
## P1 P2 P3 P4 P5 P6 P7
## 11.785714 20.357143 19.642857 11.428571 11.785714 7.857143 7.857143
## P8 P9
## 4.642857 4.642857
There are no missing observations in the data. Handling missing data in machine learning is very important. Several ways of dealing with missing values include mean and median imputation for continous and catergorical variables respectively. Several supervised learning algorthms such as random forest are also commonly used to impute missing values. Missing value imputation is a topic that would be looked at in depth later on.
aggr(potholedata , col=c('navyblue','yellow'),
numbers=TRUE, sortVars=TRUE,
labels=names(potholedata), cex.axis=.7,
gap=3, ylab=c("Missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Label_Combined 0
## Label 0
## Peak.Duration 0
## Vehicle.Speed..mph. 0
## Peak.Z.Accel 0
## Peak.Lat.Accel 0
## Peak.Long.Accel 0
## Front.Slip 0
## Right.Slip 0
## Left.Slip 0
## Direction 0
## MLT1.Feature 0
## MLT2.Feature 0
## File 0
## Date 0
plot_missing(potholedata)
There exist significant correlation between some of the variables. This is not suprising because some of the features were engineered from others already present.
plot_correlation(potholedata,type = "continuous",theme_config = list(legend.position = "bottom", axis.text.x =
element_text(angle = 90)))
potholedata%>%select(Label_Combined ,Peak.Duration, Vehicle.Speed..mph., Peak.Z.Accel,Peak.Lat.Accel, Peak.Long.Accel,Front.Slip, Right.Slip, Left.Slip, Direction)%>%ggpairs(title = "")+
theme(legend.position = "top")
pn<-potholedata%>%select(Label_Combined ,Peak.Duration, Vehicle.Speed..mph., Peak.Z.Accel,Peak.Lat.Accel, Peak.Long.Accel,Front.Slip, Right.Slip, Left.Slip, Direction)%>%ggpairs(title = "",mapping = aes(color = Label_Combined ))+
theme(legend.position = "top")
pn
We can select the elements in the matrix plot with their indices.
pn[1,1]
pn[1,2]
pn[1,3]
pn[1,4]
pn[1,5]
pn[1,6]
pn[1,7]
pn[1,8]
pn[1,9]
pn[1,10]
pn[3,2]
pn[6,5]
pn[5,3]
pn[6,4]
pn[6,5]
pn[6,3]
The summary statistics for each variable in the data
skimmed <-skim_to_wide(potholedata)
skimmed%>%
kable() %>%
kable_styling()
type | variable | missing | complete | n | min | max | median | n_unique | top_counts | ordered | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | Date | 0 | 280 | 280 | 2018-07-09 | 2018-07-16 | 2018-07-09 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
factor | Direction | 0 | 280 | 280 | NA | NA | NA | 3 | Lef: 211, Lef: 54, Lef: 15, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
factor | File | 0 | 280 | 280 | NA | NA | NA | 12 | Cal: 44, Cal: 43, Cal: 37, Cal: 32 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
factor | Label | 0 | 280 | 280 | NA | NA | NA | 9 | P2: 57, P3: 55, P1: 33, P5: 33 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
factor | Label_Combined | 0 | 280 | 280 | NA | NA | NA | 5 | B3: 87, B4: 77, B2: 57, B1: 33 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
factor | MLT1.Feature | 0 | 280 | 280 | NA | NA | NA | 3 | Typ: 166, Typ: 113, Hum: 1, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
factor | MLT2.Feature | 0 | 280 | 280 | NA | NA | NA | 2 | N/A: 279, Cre: 1, NA: 0 | FALSE | NA | NA | NA | NA | NA | NA | NA | NA |
numeric | Front.Slip | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 1.76 | 0.84 | 0.31 | 1.02 | 1.62 | 2.34 | 3.95 | ▅▇▇▅▇▂▂▁ |
numeric | Left.Slip | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 2.12 | 1.08 | 0.35 | 1.24 | 2.08 | 2.85 | 4.93 | ▅▇▆▇▆▃▁▂ |
numeric | Peak.Duration | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 0.68 | 0.51 | 0.1 | 0.32 | 0.5 | 0.92 | 2.6 | ▇▇▂▂▂▁▁▁ |
numeric | Peak.Lat.Accel | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 2.88 | 1.41 | 0.22 | 1.74 | 2.66 | 3.77 | 6.22 | ▂▇▇▇▆▃▃▂ |
numeric | Peak.Long.Accel | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 2.46 | 1.17 | 0.16 | 1.5 | 2.33 | 3.46 | 5.32 | ▂▇▇▇▆▆▃▁ |
numeric | Peak.Z.Accel | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 3.42 | 1.29 | 1.32 | 2.44 | 3.12 | 4.25 | 7.09 | ▅▇▇▅▃▃▂▁ |
numeric | Right.Slip | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 0.34 | 0.18 | 0.071 | 0.19 | 0.33 | 0.46 | 0.86 | ▇▇▆▇▅▃▁▁ |
numeric | Vehicle.Speed..mph. | 0 | 280 | 280 | NA | NA | NA | NA | NA | NA | 25.36 | 15.58 | 3.64 | 12.45 | 22.55 | 36.22 | 61.73 | ▇▇▇▅▃▂▃▂ |
#skimr::skim(potholedata)
mlr::summarizeColumns(potholedata)%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
name | type | na | mean | disp | median | mad | min | max | nlevs |
---|---|---|---|---|---|---|---|---|---|
Label_Combined | factor | 0 | NA | 0.6892857 | NA | NA | 26.0000000 | 87.0000000 | 5 |
Label | factor | 0 | NA | 0.7964286 | NA | NA | 13.0000000 | 57.0000000 | 9 |
Peak.Duration | numeric | 0 | 0.6762429 | 0.5070269 | 0.4985000 | 0.3484110 | 0.1040000 | 2.6050000 | 0 |
Vehicle.Speed..mph. | numeric | 0 | 25.3645046 | 15.5784650 | 22.5494956 | 16.2697198 | 3.6364981 | 61.7288470 | 0 |
Peak.Z.Accel | numeric | 0 | 3.4220157 | 1.2917025 | 3.1230914 | 1.2859451 | 1.3170555 | 7.0852490 | 0 |
Peak.Lat.Accel | numeric | 0 | 2.8756696 | 1.4096700 | 2.6602507 | 1.4832631 | 0.2202256 | 6.2201228 | 0 |
Peak.Long.Accel | numeric | 0 | 2.4605413 | 1.1737102 | 2.3336269 | 1.3882734 | 0.1579519 | 5.3205001 | 0 |
Front.Slip | numeric | 0 | 1.7559611 | 0.8388320 | 1.6233722 | 0.9514858 | 0.3131939 | 3.9547484 | 0 |
Right.Slip | numeric | 0 | 0.3431604 | 0.1764695 | 0.3345577 | 0.2042645 | 0.0711265 | 0.8599148 | 0 |
Left.Slip | numeric | 0 | 2.1245021 | 1.0781293 | 2.0771931 | 1.1810550 | 0.3469242 | 4.9332445 | 0 |
Direction | factor | 0 | NA | 0.2464286 | NA | NA | 15.0000000 | 211.0000000 | 3 |
MLT1.Feature | factor | 0 | NA | 0.4071429 | NA | NA | 1.0000000 | 166.0000000 | 3 |
MLT2.Feature | factor | 0 | NA | 0.0035714 | NA | NA | 1.0000000 | 279.0000000 | 2 |
File | factor | 0 | NA | 0.8428571 | NA | NA | 4.0000000 | 44.0000000 | 12 |
Date | Date | 0 | NA | NA | NA | NA | 4.0000000 | 276.0000000 | 2 |
Model Building
The rsample package can be used to split the data into training and test set.
library(rsample)
pothole_data<-potholedata%>%dplyr::select(-Date,-Label,-File,-Date,-MLT2.Feature)
data_split <- initial_split(pothole_data, strata = "Label_Combined", prop = 0.8)
pothole_train <- training(data_split)
pothole_test <- testing(data_split)
The recipes package makes prepocessing step in machine learning very convenient. Most preprocessing steps involves one single line of code. The first step is to apply the Yeo-Johnson Power transformation. It is similar to the popular Box-Cox Transformation but includes transformation family of variables that include negative values. We do this transformation to stabilize variance and make the data more normal distribution-like. Next is to center and standardize all numeric variables in the data to a mean of zero and a unit variance. With step-other function we lump factor levels that occur in <= 10% of data as “other”.step-dummy createsone-hot encoding or dummy variables for all nominal predictor factor variables except the response One-hot encoding.
pothole_recipe2 <- recipe(Label_Combined ~ ., data = pothole_train ) %>%
#Transform numeric skewed predictors
step_YeoJohnson(all_numeric()) %>%
# standardize the data
step_center(all_numeric(), -all_outcomes()) %>%
#scale the data
step_scale(all_numeric(), -all_outcomes()) %>%
#step_kpca a specification of a recipe step that will convert numeric data into one or more principal components using a kernel basis expansion.
#step_kpca(all_numeric(), num=6)%>%
#step_log(Label, base = 10)
# Lump factor levels that occur in <= 10% of data as "other"
step_other(Direction, MLT1.Feature, threshold = 0.1) %>%
# Create dummy variables for all nominal predictor factor variables except the response
step_dummy(all_nominal(), -all_outcomes())%>%
prep(data = pothole_train,retain = TRUE )
# split data into training and test set
test_tbl2 <- bake(pothole_recipe2, newdata = pothole_test)
train_tbl2 <- bake(pothole_recipe2, newdata = pothole_train)
glimpse(train_tbl2)
## Observations: 226
## Variables: 13
## $ Label_Combined <fct> B1, B1, B1, B1, B1, B1, B1, B1, B1, B1, B1...
## $ Peak.Duration <dbl> 0.006525519, -0.207802332, -0.226370118, -...
## $ Vehicle.Speed..mph. <dbl> -0.8249478, -0.5814521, -0.5330513, -0.433...
## $ Peak.Z.Accel <dbl> -2.1870155, -2.0950290, -2.2040901, -2.189...
## $ Peak.Lat.Accel <dbl> -1.3096776, -1.0552851, -1.4598802, -1.291...
## $ Peak.Long.Accel <dbl> -1.85372136, -1.04917684, -1.31762792, -1....
## $ Front.Slip <dbl> -2.0399125, -1.4628490, -1.3852557, -1.758...
## $ Right.Slip <dbl> -1.5197233, -1.4290844, -1.2639294, -1.429...
## $ Left.Slip <dbl> -2.18627766, -1.68674759, -1.52218284, -2....
## $ Direction_Left4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Direction_other <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, ...
## $ MLT1.Feature_Type2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ MLT1.Feature_other <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Multinomial Logistic Regression
The first model we consider is the multinomial logistic regression. Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.
library(ggrepel)
# Fit the multinomial logistic model
model_multlog <- nnet::multinom(Label_Combined ~., data = train_tbl2)
## # weights: 70 (52 variable)
## initial value 363.732968
## iter 10 value 186.989385
## iter 20 value 93.855296
## iter 30 value 35.866528
## iter 40 value 32.017762
## iter 50 value 30.831075
## iter 60 value 29.827671
## iter 70 value 29.455249
## iter 80 value 29.228431
## iter 90 value 28.858957
## iter 100 value 28.619521
## final value 28.619521
## stopped after 100 iterations
#summary(model_multlog )
tidy(model_multlog)%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
y.level | term | estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
B2 | (Intercept) | 1.006932e+43 | 123.808721 | 7.997665e-01 | 0.4238461 |
B2 | Peak.Duration | 6.468647e+18 | 72.453981 | 5.978070e-01 | 0.5499687 |
B2 | Vehicle.Speed..mph. | 1.104725e+07 | 32.174210 | 5.040587e-01 | 0.6142201 |
B2 | Peak.Z.Accel | 5.335858e+07 | 164.997240 | 1.078354e-01 | 0.9141263 |
B2 | Peak.Lat.Accel | 1.119356e+02 | 53.964076 | 8.742710e-02 | 0.9303320 |
B2 | Peak.Long.Accel | 1.311299e+00 | 20.762958 | 1.305300e-02 | 0.9895855 |
B2 | Front.Slip | 3.094576e+10 | 134.148466 | 1.800654e-01 | 0.8571012 |
B2 | Right.Slip | 4.387685e+14 | 107.542443 | 3.135041e-01 | 0.7538977 |
B2 | Left.Slip | 0.000000e+00 | 147.569783 | -1.478489e-01 | 0.8824620 |
B2 | Direction_Left4 | 0.000000e+00 | 0.000000 | -1.970507e+09 | 0.0000000 |
B2 | Direction_other | 5.796511e-01 | 152.479942 | -3.576400e-03 | 0.9971465 |
B2 | MLT1.Feature_Type2 | 6.878763e+05 | 7.952293 | 1.690250e+00 | 0.0909801 |
B2 | MLT1.Feature_other | 0.000000e+00 | 0.000000 | -7.021216e+09 | 0.0000000 |
B3 | (Intercept) | 4.040040e+51 | 123.983569 | 9.584181e-01 | 0.3378520 |
B3 | Peak.Duration | 1.743407e+18 | 72.251109 | 5.813388e-01 | 0.5610121 |
B3 | Vehicle.Speed..mph. | 1.023150e-02 | 32.485612 | -1.410558e-01 | 0.8878259 |
B3 | Peak.Z.Accel | 1.497944e+21 | 165.843708 | 2.940020e-01 | 0.7687564 |
B3 | Peak.Lat.Accel | 2.493845e+03 | 54.157950 | 1.444217e-01 | 0.8851675 |
B3 | Peak.Long.Accel | 3.748920e-01 | 20.817928 | -4.712850e-02 | 0.9624108 |
B3 | Front.Slip | 1.146640e+09 | 134.198452 | 1.554422e-01 | 0.8764727 |
B3 | Right.Slip | 6.288793e+15 | 107.505841 | 3.383774e-01 | 0.7350788 |
B3 | Left.Slip | 0.000000e+00 | 147.712431 | -1.929784e-01 | 0.8469758 |
B3 | Direction_Left4 | 0.000000e+00 | 15.058656 | -3.318957e+00 | 0.0009035 |
B3 | Direction_other | 0.000000e+00 | 0.000000 | -6.555174e+10 | 0.0000000 |
B3 | MLT1.Feature_Type2 | 3.297745e+07 | 6.873321 | 2.518627e+00 | 0.0117813 |
B3 | MLT1.Feature_other | 6.279000e-04 | 0.000000 | -8.836191e+10 | 0.0000000 |
B4 | (Intercept) | 2.363141e+49 | 124.004944 | 9.167914e-01 | 0.3592520 |
B4 | Peak.Duration | 4.233409e+16 | 72.255228 | 5.298491e-01 | 0.5962165 |
B4 | Vehicle.Speed..mph. | 8.000000e-07 | 32.559928 | -4.321252e-01 | 0.6656504 |
B4 | Peak.Z.Accel | 1.716274e+24 | 165.857570 | 3.364465e-01 | 0.7365342 |
B4 | Peak.Lat.Accel | 1.288872e+04 | 54.178156 | 1.746849e-01 | 0.8613272 |
B4 | Peak.Long.Accel | 1.273826e-01 | 20.849234 | -9.883150e-02 | 0.9212721 |
B4 | Front.Slip | 3.612412e+11 | 134.234403 | 1.982563e-01 | 0.8428446 |
B4 | Right.Slip | 3.315180e+18 | 107.530926 | 3.965840e-01 | 0.6916743 |
B4 | Left.Slip | 0.000000e+00 | 147.782400 | -2.586204e-01 | 0.7959281 |
B4 | Direction_Left4 | 0.000000e+00 | 14.977781 | -3.823875e+00 | 0.0001314 |
B4 | Direction_other | 3.188615e+11 | 441.523491 | 5.999230e-02 | 0.9521617 |
B4 | MLT1.Feature_Type2 | 2.136756e+08 | 6.871591 | 2.791198e+00 | 0.0052513 |
B4 | MLT1.Feature_other | 1.104226e+02 | 0.000000 | 1.040894e+13 | 0.0000000 |
B5 | (Intercept) | 4.434352e+15 | 147.413209 | 2.444025e-01 | 0.8069191 |
B5 | Peak.Duration | 2.046723e+27 | 75.591690 | 8.319173e-01 | 0.4054556 |
B5 | Vehicle.Speed..mph. | 0.000000e+00 | 56.546553 | -7.853381e-01 | 0.4322553 |
B5 | Peak.Z.Accel | 4.554239e+48 | 176.256593 | 6.356650e-01 | 0.5249948 |
B5 | Peak.Lat.Accel | 6.726265e+02 | 62.987158 | 1.033733e-01 | 0.9176667 |
B5 | Peak.Long.Accel | 2.950000e-04 | 26.937573 | -3.017524e-01 | 0.7628408 |
B5 | Front.Slip | 5.853313e+04 | 145.595906 | 7.539600e-02 | 0.9398996 |
B5 | Right.Slip | 3.327703e+21 | 108.832751 | 4.553461e-01 | 0.6488603 |
B5 | Left.Slip | 0.000000e+00 | 162.114774 | -1.253797e-01 | 0.9002230 |
B5 | Direction_Left4 | 0.000000e+00 | 29.915098 | -2.621409e+00 | 0.0087567 |
B5 | Direction_other | 5.659578e+34 | 0.000000 | 4.724513e+54 | 0.0000000 |
B5 | MLT1.Feature_Type2 | 5.638244e-01 | 20.036849 | -2.859790e-02 | 0.9771853 |
B5 | MLT1.Feature_other | 8.524256e+02 | 0.000000 | 4.544929e+88 | 0.0000000 |
z<-summary(model_multlog)$coefficients/summary(model_multlog)$standard.errors
# 2-tailed z test
((1 - pnorm(abs(z), 0, 1)) * 2) %>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
(Intercept) | Peak.Duration | Vehicle.Speed..mph. | Peak.Z.Accel | Peak.Lat.Accel | Peak.Long.Accel | Front.Slip | Right.Slip | Left.Slip | Direction_Left4 | Direction_other | MLT1.Feature_Type2 | MLT1.Feature_other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
B2 | 0.4238461 | 0.5499687 | 0.6142201 | 0.9141263 | 0.9303320 | 0.9895855 | 0.8571012 | 0.7538977 | 0.8824620 | 0.0000000 | 0.9971465 | 0.0909801 | 0 |
B3 | 0.3378520 | 0.5610121 | 0.8878259 | 0.7687564 | 0.8851675 | 0.9624108 | 0.8764727 | 0.7350788 | 0.8469758 | 0.0009035 | 0.0000000 | 0.0117813 | 0 |
B4 | 0.3592520 | 0.5962165 | 0.6656504 | 0.7365342 | 0.8613272 | 0.9212721 | 0.8428446 | 0.6916743 | 0.7959281 | 0.0001314 | 0.9521617 | 0.0052513 | 0 |
B5 | 0.8069191 | 0.4054556 | 0.4322553 | 0.5249948 | 0.9176667 | 0.7628408 | 0.9398996 | 0.6488603 | 0.9002230 | 0.0087567 | 0.0000000 | 0.9771853 | 0 |
glance(model_multlog) %>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
edf | deviance | AIC |
---|---|---|
52 | 57.23904 | 161.239 |
# Make predictions
pred3 = predict(model_multlog , newdata = test_tbl2, type="class")
data1=data.frame(test_tbl2["Label_Combined"],predicted=pred3)
#check accuracy
yardstick::metrics(data1,truth = Label_Combined, estimate = predicted)
## # A tibble: 1 x 1
## accuracy
## <dbl>
## 1 0.926
# Make predictions
predicted.classes <- model_multlog %>% predict(test_tbl2)
head(predicted.classes)
## [1] B1 B1 B2 B1 B1 B3
## Levels: B1 B2 B3 B4 B5
#Equivalently Model accuracy
mean(predicted.classes == test_tbl2$Label_Combined)
## [1] 0.9259259
#plot variable importance
feature=factor(rownames(varImp(model_multlog ,scale=T)))
Importance=varImp(model_multlog ,scale=T)$Overall
Varplot(feature,Importance)
ggsave("/POTHOLEDETECTION/variable.png")
#confusion matrix
table(predicted.classes,test_tbl2$Label_Combined)
##
## predicted.classes B1 B2 B3 B4 B5
## B1 4 0 0 0 0
## B2 1 14 0 0 0
## B3 0 1 15 0 0
## B4 0 0 2 14 0
## B5 0 0 0 0 3
Multi-class Evaluation Metrics
Macro-Averaging
The overall classification accuracy is defined as the fraction of instances that are correctly classified. In order to assess the performance of a classification,these metrics are calculated from the model predictions. In cases where the class labels are not uniformly distributed or imbalance class labels, using the accuracy metric alone may be misleading as one could predict the dominant class most of the time and still achieve a relatively high overall accuracy but very low precision or recall for other classes.
Specificity or TNR (True Negative Rate): Number of items correctly identified as negative out of total negatives- TN/(TN+FP)
Recall or Sensitivity or TPR (True Positive Rate): Number of items correctly identified as positive out of total true positives- TP/(TP+FN) Precision: Number of items correctly identified as positive out of total items identified as positive- TP/(TP+FP)
**F1 Score:** It is a harmonic mean of precision and recall given by- $F1 = \frac{2*Precision*Recall}{(Precision + Recall)} $
Accuracy: Percentage of total items classified correctly- (TP+TN)/(N+P)
Macro-averaged Metrics
The per-class metrics can be averaged over all the classes resulting in macro-averaged precision, recall and F-1.
model_multlog_pred<-predict(model_multlog , newdata = test_tbl2, type="probs")
#pred <- predict(model_multlog, newdata = test_tbl2, probability = TRUE)
#MLmetrics::MultiLogLoss(y_true = test_tbl2$Label_Combined, y_pred = attr(model_multlog_pred, "probabilities"))
Overall Classification Accuracy
l=caret::confusionMatrix(predicted.classes,test_tbl2$Label_Combined)
overall_classification_accuracy<-sum(diag(l$table))/sum(l$table)
l$overall%>%data.frame()%>%
rename(`Macro Measure`=".")%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Accuracy | 0.9259259 |
Kappa | 0.9002770 |
AccuracyLower | 0.8210666 |
AccuracyUpper | 0.9794490 |
AccuracyNull | 0.3148148 |
AccuracyPValue | 0.0000000 |
McnemarPValue | NaN |
Per-class Precision, Recall, and F-1
l$byClass%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
---|---|---|---|---|---|---|---|---|---|---|---|
Class: B1 | 0.8000000 | 1.000000 | 1.0000000 | 0.9800000 | 1.0000000 | 0.8000000 | 0.8888889 | 0.0925926 | 0.0740741 | 0.0740741 | 0.9000000 |
Class: B2 | 0.9333333 | 0.974359 | 0.9333333 | 0.9743590 | 0.9333333 | 0.9333333 | 0.9333333 | 0.2777778 | 0.2592593 | 0.2777778 | 0.9538462 |
Class: B3 | 0.8823529 | 0.972973 | 0.9375000 | 0.9473684 | 0.9375000 | 0.8823529 | 0.9090909 | 0.3148148 | 0.2777778 | 0.2962963 | 0.9276630 |
Class: B4 | 1.0000000 | 0.950000 | 0.8750000 | 1.0000000 | 0.8750000 | 1.0000000 | 0.9333333 | 0.2592593 | 0.2592593 | 0.2962963 | 0.9750000 |
Class: B5 | 1.0000000 | 1.000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0555556 | 0.0555556 | 0.0555556 | 1.0000000 |
Macro-averaged Metrics
macro_meaure=l$byClass%>%apply(.,2,mean)%>%
data.frame()%>%
rename(`Macro Measure`=".")
macro_meaure%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Sensitivity | 0.9231373 |
Specificity | 0.9794664 |
Pos Pred Value | 0.9491667 |
Neg Pred Value | 0.9803455 |
Precision | 0.9491667 |
Recall | 0.9231373 |
F1 | 0.9329293 |
Prevalence | 0.2000000 |
Detection Rate | 0.1851852 |
Detection Prevalence | 0.2000000 |
Balanced Accuracy | 0.9513018 |
The confusion matrix is given as:
l$table%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
B1 | B2 | B3 | B4 | B5 | |
---|---|---|---|---|---|
B1 | 4 | 0 | 0 | 0 | 0 |
B2 | 1 | 14 | 0 | 0 | 0 |
B3 | 0 | 1 | 15 | 0 | 0 |
B4 | 0 | 0 | 2 | 14 | 0 |
B5 | 0 | 0 | 0 | 0 | 3 |
The function below can be used to find the multi-class logloss. Logloss is among the numerous evaluation metric for multi-class classfication models.
pred_probs1<-predict(model_multlog , newdata = test_tbl2, type="probs")
#truth: integer vector with truth labels, values range from 0 to n - 1 classes
# prob_matrix: predicted probs: column 1 => label 0, column 2 => label 1 and so on
Multiclasslogloss = function(truth, pred_prob_matrix, eps = 1e-15){
if(is.double(truth)=="FALSE"){
truth=as.numeric(truth)-1
}
if(max(truth) >= ncol(pred_prob_matrix) || min(truth) < 0){
stop(cat('True labels should range from 0 to', ncol(pred_prob_matrix) - 1, '\n'))
}
pred_prob_matrix[pred_prob_matrix > 1 - eps] = 1 - eps
pred_prob_matrix[pred_prob_matrix< eps] = eps
pred_prob_matrix = t(apply(pred_prob_matrix, 1, function(r)r/sum(r)))
truth_matrix = matrix(0, nrow = nrow(pred_prob_matrix), ncol = ncol(pred_prob_matrix))
truth_matrix[matrix(c(1:nrow(pred_prob_matrix), truth + 1), ncol = 2)] = 1
-sum(truth_matrix * log(pred_prob_matrix))/nrow(pred_prob_matrix)
}
Multiclasslogloss(test_tbl2$Label_Combined, pred_probs1)
## [1] 0.3282304
Micro-averaged Metrics
The micro-averaged precision, recall, and F-1 can also be computed from the matrix above. Compared to unweighted macro-averaging, micro-averaging favors classes with a larger number of instances. Because the sum of the one-vs-all matrices is a symmetric matrix, the micro-averaged precision, recall, and F-1 wil be the same.
n = sum(l$table) # number of instances
nc = nrow(l$table) # number of classes
diag = diag(l$table) # number of correctly classified instances per class
rowsums = apply(l$table, 1, sum) # number of instances per class
colsums = apply(l$table, 2, sum) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes
precision = diag / colsums
recall = diag / rowsums
f1 = 2 * precision * recall / (precision + recall)
data.frame(precision, recall, f1)
## precision recall f1
## B1 0.8000000 1.0000000 0.8888889
## B2 0.9333333 0.9333333 0.9333333
## B3 0.8823529 0.9375000 0.9090909
## B4 1.0000000 0.8750000 0.9333333
## B5 1.0000000 1.0000000 1.0000000
One-vs-all
For micro-averaging metrics , we can examine the performance of the classifier one class at a time. This results in 5 binary classification task such that the class considered is the positive class and all others constitute the negative class.
pacman::p_load(furrr)
### faster version using furrr package
oneVsAll =furrr::future_map(1 : nc,
function(i){
m = c(l$table[i,i],
rowsums[i] - l$table[i,i],
colsums[i] - l$table[i,i],
n-rowsums[i] - colsums[i] + l$table[i,i]);
return(matrix(m, nrow = 2, byrow = T))})
#sum matrices element wise
#s=rowSums(c, dims = 2)
#equivalenlty
#apply(c, c(1,2), sum)
## A general-purpose adder:
#this function adds the elements of the matrix
#in the list elementwise
add <- function(x) Reduce("+", x)
sum_all<-add(oneVsAll)
Average Accuracy
Similar to the overall accuracy, the average accuracy is defined as the fraction of correctly classified instances in the sum of one-vs-all matrices matrix.
Average_Accuracy <- sum(diag(sum_all)) / sum(sum_all)
Average_Accuracy
## [1] 0.9703704
Micro-averaged Metrics
The micro-averaged precision, recall, and F-1 can also be computed from the matrix above. Compared to unweighted macro-averaging, micro-averaging favors classes with a larger number of instances. Because the sum of the one-vs-all matrices is a symmetric matrix, the micro-averaged precision, recall, and F-1 wil be the same.
Micro_Accuracy<-(diag(sum_all) / apply(sum_all,1, sum))[1];
Micro_Accuracy
## [1] 0.9259259
Weighted k-Nearest Neighbor Classifier
The Weighted k-Nearest Neighbor method expands k-nearest neighbor in several directions.It is used for both classification,ordinal classfication and regression. It uses kernel functions to weight the neighbors according to their distances.
fitControl <- trainControl(method="repeatedcv",
#For repeated k-fold cross-validation only: the number of complete sets of folds to compute
repeats = 10,
savePredictions = "final",
classProbs=TRUE,
#Either the number of folds or number of resampling iterations
number = 10,
summaryFunction=multiClassSummary,
allowParallel=TRUE)
knn_grid <- expand.grid(
kmax = 1:9,
distance = 1:3,
kernel = c("rectangular", "triangular", "gaussian")
)
knnFit <- caret::train(Label_Combined ~ .,
data = train_tbl2,
method = "kknn",
trControl = fitControl,
tuneGrid = knn_grid,
summaryFunction=multiClassSummary,
## Specify which metric to optimize
metric = "Accuracy")
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 226 samples
## 12 predictor
## 5 classes: 'B1', 'B2', 'B3', 'B4', 'B5'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 204, 205, 203, 204, 204, 202, ...
## Resampling results across tuning parameters:
##
## kmax distance kernel logLoss AUC prAUC Accuracy
## 1 1 rectangular 4.548771 0.9162104 0.06699458 0.8682996
## 1 1 triangular 4.548771 0.9162104 0.06699458 0.8682996
## 1 1 gaussian 4.548771 0.9162104 0.06699458 0.8682996
## 1 2 rectangular 5.815951 0.8920005 0.08356107 0.8316110
## 1 2 triangular 5.815951 0.8920005 0.08356107 0.8316110
## 1 2 gaussian 5.815951 0.8920005 0.08356107 0.8316110
## 1 3 rectangular 6.352123 0.8835768 0.08867452 0.8160872
## 1 3 triangular 6.352123 0.8835768 0.08867452 0.8160872
## 1 3 gaussian 6.352123 0.8835768 0.08867452 0.8160872
## 2 1 rectangular 4.548771 0.9162104 0.06699458 0.8682996
## Kappa Mean_F1 Mean_Sensitivity Mean_Specificity
## 0.8277033 0.8707264 0.8669524 0.9654684
## 0.8277033 0.8707264 0.8669524 0.9654684
## 0.8277033 0.8707264 0.8669524 0.9654684
## 0.7792944 0.8375603 0.8288476 0.9551534
## 0.7792944 0.8375603 0.8288476 0.9551534
## 0.7792944 0.8375603 0.8288476 0.9551534
## 0.7592663 0.8232414 0.8160048 0.9511489
## 0.7592663 0.8232414 0.8160048 0.9511489
## 0.7592663 0.8232414 0.8160048 0.9511489
## 0.8277033 0.8707264 0.8669524 0.9654684
## Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall
## 0.8911250 0.9669165 0.8911250 0.8669524
## 0.8911250 0.9669165 0.8911250 0.8669524
## 0.8911250 0.9669165 0.8911250 0.8669524
## 0.8664536 0.9575990 0.8664536 0.8288476
## 0.8664536 0.9575990 0.8664536 0.8288476
## 0.8664536 0.9575990 0.8664536 0.8288476
## 0.8534597 0.9538081 0.8534597 0.8160048
## 0.8534597 0.9538081 0.8534597 0.8160048
## 0.8534597 0.9538081 0.8534597 0.8160048
## 0.8911250 0.9669165 0.8911250 0.8669524
## Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.1736599 0.9162104
## 0.1736599 0.9162104
## 0.1736599 0.9162104
## 0.1663222 0.8920005
## 0.1663222 0.8920005
## 0.1663222 0.8920005
## 0.1632174 0.8835768
## 0.1632174 0.8835768
## 0.1632174 0.8835768
## 0.1736599 0.9162104
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 3, distance = 1 and
## kernel = triangular.
getTrainPerf(knnFit)
## TrainlogLoss TrainAUC TrainprAUC TrainAccuracy TrainKappa TrainMean_F1
## 1 2.647393 0.9457703 0.2740201 0.8757911 0.8374018 0.8769454
## TrainMean_Sensitivity TrainMean_Specificity TrainMean_Pos_Pred_Value
## 1 0.8727143 0.9674464 0.897
## TrainMean_Neg_Pred_Value TrainMean_Precision TrainMean_Recall
## 1 0.9689589 0.897 0.8727143
## TrainMean_Detection_Rate TrainMean_Balanced_Accuracy method
## 1 0.1751582 0.9200803 kknn
ggplot(knnFit)
saveRDS(knnFit , "/POTHOLEDETECTION/knnFit.rds")
knnFit <- readRDS("/POTHOLEDETECTION/knnFit.rds")
kknnpred1 = predict(knnFit, newdata = test_tbl2, type="prob")
kknnpred = predict(knnFit, newdata = test_tbl2, type="raw")
library(yardstick)
# Compute the summary statistics
d2=data.frame(test_tbl2["Label_Combined"],predicted=kknnpred)
yardstick::metrics(d2,truth = Label_Combined, estimate = predicted)
## # A tibble: 1 x 1
## accuracy
## <dbl>
## 1 0.833
yardstick::conf_mat(d2,truth = Label_Combined, estimate = predicted)
## Truth
## Prediction B1 B2 B3 B4 B5
## B1 5 2 0 0 0
## B2 0 10 1 0 0
## B3 0 3 15 1 0
## B4 0 0 1 12 0
## B5 0 0 0 1 3
l=caret::confusionMatrix(kknnpred,test_tbl2$Label_Combined)
l$overall%>%data.frame()%>%
rename(`Macro Measure`=".")%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Accuracy | 0.8333333 |
Kappa | 0.7789905 |
AccuracyLower | 0.7070588 |
AccuracyUpper | 0.9208456 |
AccuracyNull | 0.3148148 |
AccuracyPValue | 0.0000000 |
McnemarPValue | NaN |
l$byClass%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
---|---|---|---|---|---|---|---|---|---|---|---|
Class: B1 | 1.0000000 | 0.9591837 | 0.7142857 | 1.0000000 | 0.7142857 | 1.0000000 | 0.8333333 | 0.0925926 | 0.0925926 | 0.1296296 | 0.9795918 |
Class: B2 | 0.6666667 | 0.9743590 | 0.9090909 | 0.8837209 | 0.9090909 | 0.6666667 | 0.7692308 | 0.2777778 | 0.1851852 | 0.2037037 | 0.8205128 |
Class: B3 | 0.8823529 | 0.8918919 | 0.7894737 | 0.9428571 | 0.7894737 | 0.8823529 | 0.8333333 | 0.3148148 | 0.2777778 | 0.3518519 | 0.8871224 |
Class: B4 | 0.8571429 | 0.9750000 | 0.9230769 | 0.9512195 | 0.9230769 | 0.8571429 | 0.8888889 | 0.2592593 | 0.2222222 | 0.2407407 | 0.9160714 |
Class: B5 | 1.0000000 | 0.9803922 | 0.7500000 | 1.0000000 | 0.7500000 | 1.0000000 | 0.8571429 | 0.0555556 | 0.0555556 | 0.0740741 | 0.9901961 |
macro_meaure=l$byClass%>%apply(.,2,mean)%>%
data.frame()%>%
rename(`Macro Measure`=".")
macro_meaure%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Sensitivity | 0.8812325 |
Specificity | 0.9561653 |
Pos Pred Value | 0.8171854 |
Neg Pred Value | 0.9555595 |
Precision | 0.8171854 |
Recall | 0.8812325 |
F1 | 0.8363858 |
Prevalence | 0.2000000 |
Detection Rate | 0.1666667 |
Detection Prevalence | 0.2000000 |
Balanced Accuracy | 0.9186989 |
Multiclasslogloss(test_tbl2$Label_Combined, kknnpred1)
## [1] 2.122451
Multinomial Logistic Regression with H20
The h2o.glm function can be used to fit multinomial logistic regression and a host of other generalized linear regression models including poisson, ordinal ,ridge and LASSO.
library(h2o)
h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 13 hours
## H2O cluster timezone: America/Detroit
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.2
## H2O cluster version age: 2 months and 5 days
## H2O cluster name: H2O_started_from_R_nanaakwasiabayieboateng_par282
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.54 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.0 (2018-04-23)
pf=as.h2o(train_tbl2)
##
|
| | 0%
|
|=================================================================| 100%
pftest=as.h2o(test_tbl2)
##
|
| | 0%
|
|=================================================================| 100%
# Run GLM
myX = setdiff(colnames(train_tbl2), "Label_Combined")
h2o_mult<-h2o.glm(y = "Label_Combined", x = myX, training_frame = pf, family = "multinomial",
nfolds = 0, alpha = 0, lambda_search = FALSE)
##
|
| | 0%
|
|=================================================================| 100%
h2o.performance(h2o_mult)
## H2OMultinomialMetrics: glm
## ** Reported on training data. **
##
## Training Set Metrics:
## =====================
##
## Extract training frame with `h2o.getFrame("train_tbl2")`
## MSE: (Extract with `h2o.mse`) 0.05720497
## RMSE: (Extract with `h2o.rmse`) 0.2391756
## Logloss: (Extract with `h2o.logloss`) 0.2008426
## Mean Per-Class Error: 0.06190476
## Null Deviance: (Extract with `h2o.nulldeviance`) 688.4571
## Residual Deviance: (Extract with `h2o.residual_deviance`) 90.78083
## R^2: (Extract with `h2o.r2`) 0.9580881
## AIC: (Extract with `h2o.aic`) NaN
## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
## =========================================================================
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
## B1 B2 B3 B4 B5 Error Rate
## B1 28 0 0 0 0 0.0000 = 0 / 28
## B2 1 39 2 0 0 0.0714 = 3 / 42
## B3 0 3 60 7 0 0.1429 = 10 / 70
## B4 0 0 6 57 0 0.0952 = 6 / 63
## B5 0 0 0 0 23 0.0000 = 0 / 23
## Totals 29 42 68 64 23 0.0841 = 19 / 226
##
## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
## =======================================================================
## Top-5 Hit Ratios:
## k hit_ratio
## 1 1 0.915929
## 2 2 1.000000
## 3 3 1.000000
## 4 4 1.000000
## 5 5 1.000000
pftest=as.h2o(test_tbl2)
##
|
| | 0%
|
|=================================================================| 100%
predh20mult=h2o.predict(h2o_mult,pftest)
##
|
| | 0%
|
|=================================================================| 100%
predict_model.H2OMultinomialModel <- function(x, newdata, type, ...) {
# Function performs prediction and returns dataframe with Response
# x is h2o model
# newdata is data frame
# type is only setup for data frame
pred <- h2o.predict(x, as.h2o(newdata))
# return classification probabilities only
return(as.data.frame(pred[,-1]))
}
h2ologisticpred=predict_model.H2OMultinomialModel(h2o_mult,pftest)
##
|
| | 0%
|
|=================================================================| 100%
h2o_multdata<-data.frame(h2o.varimp(h2o_mult))
Varplot(feature=h2o_multdata$names,Importance=h2o_multdata$coefficients)
ggsave("/POTHOLEDETECTION/h2o_multcombined.png")
Metric Evaluation
# predicted probabilities for each class
head(h2ologisticpred)
## B1 B2 B3 B4 B5
## 1 0.97217723 2.776202e-02 6.074508e-05 4.476462e-09 5.842084e-21
## 2 0.99993751 6.248919e-05 1.404794e-13 3.751218e-19 2.523806e-33
## 3 0.15851847 8.404513e-01 1.030224e-03 2.189931e-09 5.605275e-23
## 4 0.98302422 1.697135e-02 4.427456e-06 4.436548e-14 7.376240e-28
## 5 0.98546664 1.453336e-02 1.128159e-09 2.050341e-15 4.547862e-28
## 6 0.02611215 4.125397e-01 5.542656e-01 7.082515e-03 2.472140e-10
#select the indices of the column on rows of the class with the highest predicted probability as the class label prediction
x=apply(h2ologisticpred, 1, which.max)
prob=apply(h2ologisticpred,1,max)
## Equivalent way to convert numeric values to factor variables with dplyr
# recode works better, it preserves levels in the factors
y=recode_factor(x, `1` = "B1", `2` = "B2", `3` = "B3",`4` = "B4", `5` = "B5")
probdataframe<-data.frame(x=x,prob=prob,predclass=y)
#accuracy
mean(test_tbl2$Label_Combined==probdataframe$predclass)
## [1] 0.8703704
l=caret::confusionMatrix(probdataframe$predclass,test_tbl2$Label_Combined)
l$overall%>%data.frame()%>%
rename(`Macro Measure`=".")%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Accuracy | 0.8703704 |
Kappa | 0.8269231 |
AccuracyLower | 0.7509878 |
AccuracyUpper | 0.9462570 |
AccuracyNull | 0.3148148 |
AccuracyPValue | 0.0000000 |
McnemarPValue | NaN |
l$byClass%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
---|---|---|---|---|---|---|---|---|---|---|---|
Class: B1 | 0.8000000 | 0.9591837 | 0.6666667 | 0.9791667 | 0.6666667 | 0.8000000 | 0.7272727 | 0.0925926 | 0.0740741 | 0.1111111 | 0.8795918 |
Class: B2 | 0.7333333 | 0.9743590 | 0.9166667 | 0.9047619 | 0.9166667 | 0.7333333 | 0.8148148 | 0.2777778 | 0.2037037 | 0.2222222 | 0.8538462 |
Class: B3 | 0.8823529 | 0.9459459 | 0.8823529 | 0.9459459 | 0.8823529 | 0.8823529 | 0.8823529 | 0.3148148 | 0.2777778 | 0.3148148 | 0.9141494 |
Class: B4 | 1.0000000 | 0.9500000 | 0.8750000 | 1.0000000 | 0.8750000 | 1.0000000 | 0.9333333 | 0.2592593 | 0.2592593 | 0.2962963 | 0.9750000 |
Class: B5 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0555556 | 0.0555556 | 0.0555556 | 1.0000000 |
macro_meaure=l$byClass%>%apply(.,2,mean)%>%
data.frame()%>%
rename(`Macro Measure`=".")
macro_meaure%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Sensitivity | 0.8831373 |
Specificity | 0.9658977 |
Pos Pred Value | 0.8681373 |
Neg Pred Value | 0.9659749 |
Precision | 0.8681373 |
Recall | 0.8831373 |
F1 | 0.8715548 |
Prevalence | 0.2000000 |
Detection Rate | 0.1740741 |
Detection Prevalence | 0.2000000 |
Balanced Accuracy | 0.9245175 |
Multiclasslogloss(test_tbl2$Label_Combined, as.matrix(predh20mult[,-1]))
## [1] 0.2677523
Extreme Gradient Boosting Machines
Gradient boosting is a popular machine learning lagorithm for regression and classification problems. It produces a prediction model in the form of an ensemble of weak prediction models using a baselearner such as decision tree. It builds the model in a stage-wise fashion similar to other boosting methods, and then generalizes them by allowing optimization of an arbitrary differentiable loss function. The XGBoost implementation of Gradient Boosting uses a more regularized model formalization to control over-fitting. This gives it a better performance and reduces overfitting.
xgbmGrid <- expand.grid(max_depth = c(1, 5, 8,10),
nrounds=300,
colsample_bytree = c(0.3,0.4,0.5,0.6,0.8,1),
eta = seq.int(from=0.01, to=0.3, length.out=5),
min_child_weight = 1:3,
subsample=1,
#rate_drop=0.1,
#skip_drop=0.1
gamma=0.001
)
xgbm <-caret::train(Label_Combined ~ ., data = train_tbl2,
method = "xgbTree",
trControl = fitControl,
verbose = FALSE,
## Now specify the exact models
## to evaluate:
#summaryFunction=multiClassSummary,
tuneGrid = xgbmGrid)
saveRDS(xgbm , "/POTHOLEDETECTION/xgbm.rds")
xgbm <- readRDS("/POTHOLEDETECTION/xgbm.rds")
getTrainPerf(xgbm)
## TrainlogLoss TrainAUC TrainprAUC TrainAccuracy TrainKappa TrainMean_F1
## 1 0.3601013 0.9821397 0.6814514 0.868236 0.8277019 0.8726167
## TrainMean_Sensitivity TrainMean_Specificity TrainMean_Pos_Pred_Value
## 1 0.8734095 0.9649483 0.8925939
## TrainMean_Neg_Pred_Value TrainMean_Precision TrainMean_Recall
## 1 0.9663713 0.8925939 0.8734095
## TrainMean_Detection_Rate TrainMean_Balanced_Accuracy method
## 1 0.1736472 0.9191789 xgbTree
ggplot(xgbm)
xgbm$bestTune
## nrounds max_depth eta gamma colsample_bytree min_child_weight
## 122 300 8 0.0825 0.001 0.8 2
## subsample
## 122 1
xgbm_pred = predict(xgbm, newdata = test_tbl2, type="raw")
xgbm_pred1 = predict(xgbm, newdata = test_tbl2, type="prob")
xgbmd=data.frame(test_tbl2["Label_Combined"],predicted=xgbm_pred)
yardstick::metrics(xgbmd,truth = Label_Combined, estimate = predicted)
## # A tibble: 1 x 1
## accuracy
## <dbl>
## 1 0.907
l=caret::confusionMatrix(xgbm_pred,test_tbl2$Label_Combined)
l$overall%>%data.frame()%>%
rename(`Macro Measure`=".")%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Accuracy | 0.9074074 |
Kappa | 0.8756333 |
AccuracyLower | 0.7969985 |
AccuracyUpper | 0.9692472 |
AccuracyNull | 0.3148148 |
AccuracyPValue | 0.0000000 |
McnemarPValue | NaN |
l$byClass%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy | |
---|---|---|---|---|---|---|---|---|---|---|---|
Class: B1 | 0.8000000 | 0.9795918 | 0.8000000 | 0.9795918 | 0.8000000 | 0.8000000 | 0.8000000 | 0.0925926 | 0.0740741 | 0.0925926 | 0.8897959 |
Class: B2 | 0.8000000 | 0.9743590 | 0.9230769 | 0.9268293 | 0.9230769 | 0.8000000 | 0.8571429 | 0.2777778 | 0.2222222 | 0.2407407 | 0.8871795 |
Class: B3 | 0.9411765 | 0.9459459 | 0.8888889 | 0.9722222 | 0.8888889 | 0.9411765 | 0.9142857 | 0.3148148 | 0.2962963 | 0.3333333 | 0.9435612 |
Class: B4 | 1.0000000 | 0.9750000 | 0.9333333 | 1.0000000 | 0.9333333 | 1.0000000 | 0.9655172 | 0.2592593 | 0.2592593 | 0.2777778 | 0.9875000 |
Class: B5 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0555556 | 0.0555556 | 0.0555556 | 1.0000000 |
macro_meaure=l$byClass%>%apply(.,2,mean)%>%
data.frame()%>%
rename(`Macro Measure`=".")
macro_meaure%>%
kable(escape = F, align = "c") %>%
kable_styling(c("striped", "condensed"), full_width = F)
Macro Measure | |
---|---|
Sensitivity | 0.9082353 |
Specificity | 0.9749794 |
Pos Pred Value | 0.9090598 |
Neg Pred Value | 0.9757287 |
Precision | 0.9090598 |
Recall | 0.9082353 |
F1 | 0.9073892 |
Prevalence | 0.2000000 |
Detection Rate | 0.1814815 |
Detection Prevalence | 0.2000000 |
Balanced Accuracy | 0.9416073 |
Multiclasslogloss(test_tbl2$Label_Combined, xgbm_pred1)
## [1] 0.263521
Conclusion
The multinomial logistic regression has superior performance over the machine learning algorithms evaluated in this project. It’s evaluation metrics including accuracy, F-1 score, sensitivity, specificity and precision is higher than for the other models considered.