Дерево решений rpart через карет, давая ROC 0,5 - PullRequest
0 голосов
/ 07 марта 2019

Если я правильно понимаю ROC, 0.5 - это нулевая модель с 0 прогнозирующей силой.Я использую те же данные, чтобы согласовать логистическую регрессию с ROC 0,64, поэтому я предполагаю, что в данных есть некоторая прогностическая способность.

Мне интересно, если у меня где-то неверная конфигурация:

## tuning & parameters
set.seed(123)
train_control <- trainControl(
  method = "cv",
  number = 5,
  savePredictions = TRUE,
  verboseIter = TRUE,
  classProbs = TRUE,
  summaryFunction = my_summary
)

linear_model = train(
  x = training_data %>% select(-Avg_Load_Time),
  y = target,
  trControl = train_control,
  method = "glm", # logistic regression
  family = "binomial",
  metric = "ROC"
)

Дает РПЦ 0,64.

Тогда я попробовал дерево:

tree_model = train(
  x = training_data %>% select(-Avg_Load_Time),
  y = target,
  trControl = train_control,
  method = "rpart", # decision tree
  metric = "ROC",
  tuneLength = 20
)

Дает РПЦ 0,5

Вот все различные оценкиметрики для этих двух моделей:

> summary(results)

Call:
summary.resamples(object = results)

Models: logit, tree 
Number of resamples: 100 

Accuracy 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.9817212 0.9824695 0.9824695 0.9823225 0.9824695 0.9824829    0
tree  0.9817352 0.9824695 0.9824695 0.9823226 0.9824695 0.9824695    0

AUC 
           Min.   1st Qu.   Median      Mean   3rd Qu.     Max. NA's
logit 0.9867658 0.9888867 0.990663 0.9896725 0.9907191 0.991328    0
tree  0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000    0

F 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.9907763 0.9911572 0.9911572 0.9910824 0.9911572 0.9911640    0
tree  0.9907834 0.9911572 0.9911572 0.9910825 0.9911572 0.9911572    0

Kappa 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit    0       0      0    0       0    0    0
tree     0       0      0    0       0    0    0

Precision 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.9817212 0.9824695 0.9824695 0.9823225 0.9824695 0.9824829    0
tree  0.9817352 0.9824695 0.9824695 0.9823226 0.9824695 0.9824695    0

Recall 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit    1       1      1    1       1    1    0
tree     1       1      1    1       1    1    0

ROC 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.5741854 0.6315647 0.6589653 0.6448492 0.6685837 0.6909468    0
tree  0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000    0

Sens 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit    1       1      1    1       1    1    0
tree     1       1      1    1       1    1    0

Spec 
      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
logit    0       0      0    0       0    0    0
tree     0       0      0    0       0    0    0

AUC - это prAUC, а ROC - это ROC AUC.Числа для логит-регрессии соответствуют ожиданиям, но для дерева, похоже, что-то не так, так как ROC равно 0,5.Что-то не так с моей конфигурацией train ()?

Еще несколько подробностей о моих данных:

x - это фрейм данных с целью, объединенной с данными обучения

summary(x)
 userTypeNewVisitor deviceCategorydesktop Traffic_TypePaidTraffic Log_Avg_Load_Time Avg_Load_Time    target   
 Min.   :0.0000     Min.   :0.0000        Min.   :0.0000          Min.   :-1.4271   Min.   :  0.24   X0:6446  
 1st Qu.:0.0000     1st Qu.:0.0000        1st Qu.:0.0000          1st Qu.: 0.8416   1st Qu.:  2.32   X1: 116  
 Median :0.0000     Median :0.0000        Median :0.0000          Median : 1.4516   Median :  4.27            
 Mean   :0.3478     Mean   :0.2138        Mean   :0.4139          Mean   : 1.5607   Mean   : 10.18            
 3rd Qu.:1.0000     3rd Qu.:0.0000        3rd Qu.:1.0000          3rd Qu.: 2.1668   3rd Qu.:  8.73            
 Max.   :1.0000     Max.   :1.0000        Max.   :1.0000          Max.   : 6.1834   Max.   :484.62    

А вот dput только для истинного целевого класса:

> dput(glimpse(x %>% filter(target == "X1")))
Observations: 116
Variables: 6
$ userTypeNewVisitor      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
$ deviceCategorydesktop   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Traffic_TypePaidTraffic <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, …
$ Log_Avg_Load_Time       <dbl> 0.58221562, 0.97077892, 0.98954119, 1.80500470, 1.37371558, 2.38508631, 2.47232787, 2.00417906, 1.43270073, 1.19694819, 0.44468582, 1.68824909, 1.34025042, 1.06815308, 1.28923265, 1.6…
$ Avg_Load_Time           <dbl> 1.79, 2.64, 2.69, 6.08, 3.95, 10.86, 11.85, 7.42, 4.19, 3.31, 1.56, 5.41, 3.82, 2.91, 3.63, 4.99, 1.29, 4.60, 8.98, 2.59, 3.01, 5.18, 4.73, 3.75, 3.40, 5.46, 4.65, 3.10, 5.78, 5.81, 1…
$ target                  <fct> X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1, X1,…
structure(list(userTypeNewVisitor = c(1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 1, 0), deviceCategorydesktop = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0), Traffic_TypePaidTraffic = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 
0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 
0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 
0, 0), Log_Avg_Load_Time = c(0.582215619852664, 0.970778917158225, 
0.989541193613748, 1.80500469597808, 1.37371557891303, 2.38508631450579, 
2.47232786758114, 2.00417905717929, 1.43270073393405, 1.19694818938897, 
0.444685821261446, 1.68824909285839, 1.34025042261848, 1.0681530811834, 
1.28923264827676, 1.60743590976343, 0.254642218373581, 1.52605630349505, 
2.19499988231411, 0.951657875711446, 1.10194007876078, 1.64480505627139, 
1.55392520250384, 1.32175583998232, 1.22377543162212, 1.69744878975681, 
1.53686721959926, 1.1314021114911, 1.75440368268429, 1.75958057086382, 
2.46640317822344, 1.12167756159911, 1.41827740697294, -0.0202027073175195, 
1.25276296849537, 1.43508452528932, 2.75110969056266, 0.741937344729377, 
0.405465108108164, 0.78845736036427, 1.45161382724053, 2.00552585872967, 
1.47704872438835, 2.797890905102, 1.11841491596429, 0.86288995514704, 
1.9473377010465, 0.662687973075237, 0.392042087776024, 1.0952733874026, 
0.978326122793608, 1.66770682055808, 1.52822785700856, 1.34807314829969, 
1.51512723296286, 1.3609765531356, 0.85015092936961, 1.41098697371026, 
0.824175442966349, 0.854415328156068, 1.20896034583698, 0.524728528934982, 
1.07840958135059, -0.2484613592985, 0.641853886172395, 1.68824909285839, 
1.29198368164865, 0.751416088683921, 1.16627093714192, 1.83098018238134, 
1.45161382724053, 1.5953389880546, 0.802001585472027, 1.58719230348678, 
1.34025042261848, 1.25561603747777, 1.56024766824333, 0.828551817566148, 
0.582215619852664, 2.23964529322017, 0.871293365943419, 1.87793716546911, 
1.10856261952128, 1.69193913394584, 1.880990602956, 1.35066718347674, 
0.774727167552368, 1.36609165380237, 2.10169215061466, 1.24126858906963, 
0.904218150639886, 1.26412672714568, 1.67896397508271, 0.350656871613169, 
0.431782416425538, 1.54115907168081, 1.45161382724053, 1.34286480319255, 
1.25276296849537, 1.4747630091075, 1.51072193949494, 1.10194007876078, 
0.908258560176891, 2.36273901581379, 1.42791603581071, 2.10778601468898, 
0.615185639090233, 1.24703229378638, 0.810930216216329, 1.19392246847243, 
1.37371557891303, 1.56653041142282, 1.07840958135059, 0.27002713721306, 
1.55180879959746, 0.797507195884188), Avg_Load_Time = c(1.79, 
2.64, 2.69, 6.08, 3.95, 10.86, 11.85, 7.42, 4.19, 3.31, 1.56, 
5.41, 3.82, 2.91, 3.63, 4.99, 1.29, 4.6, 8.98, 2.59, 3.01, 5.18, 
4.73, 3.75, 3.4, 5.46, 4.65, 3.1, 5.78, 5.81, 11.78, 3.07, 4.13, 
0.98, 3.5, 4.2, 15.66, 2.1, 1.5, 2.2, 4.27, 7.43, 4.38, 16.41, 
3.06, 2.37, 7.01, 1.94, 1.48, 2.99, 2.66, 5.3, 4.61, 3.85, 4.55, 
3.9, 2.34, 4.1, 2.28, 2.35, 3.35, 1.69, 2.94, 0.78, 1.9, 5.41, 
3.64, 2.12, 3.21, 6.24, 4.27, 4.93, 2.23, 4.89, 3.82, 3.51, 4.76, 
2.29, 1.79, 9.39, 2.39, 6.54, 3.03, 5.43, 6.56, 3.86, 2.17, 3.92, 
8.18, 3.46, 2.47, 3.54, 5.36, 1.42, 1.54, 4.67, 4.27, 3.83, 3.5, 
4.37, 4.53, 3.01, 2.48, 10.62, 4.17, 8.23, 1.85, 3.48, 2.25, 
3.3, 3.95, 4.79, 2.94, 1.31, 4.72, 2.22), target = structure(c(2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L), .Label = c("X0", "X1"), class = "factor")), row.names = c(NA, 
-116L), class = "data.frame")

Я добавил параметр в trainControl sampling = "up", вот результаты после этого:

> summary(results)

Call:
summary.resamples(object = results)

Models: logit, tree 
Number of resamples: 100 

Accuracy 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.5907012 0.6009139 0.6051829 0.6071289 0.6051829 0.6336634    0
tree  0.9207317 0.9329779 0.9375476 0.9352332 0.9420732 0.9428354    0

AUC 
            Min.    1st Qu.     Median       Mean    3rd Qu.       Max. NA's
logit 0.98673234 0.98959375 0.98968944 0.98978234 0.99070965 0.99218652    0
tree  0.04041049 0.04273307 0.04570514 0.04832824 0.05028082 0.06251166    0

F 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.7399516 0.7473481 0.7502411 0.7522966 0.7507218 0.7732202    0
tree  0.9586974 0.9652723 0.9677419 0.9664735 0.9701023 0.9705536    0

Kappa 
              Min.      1st Qu.      Median       Mean    3rd Qu.       Max. NA's
logit  0.005937197  0.013162973 0.017745938 0.01612017 0.01831303 0.02544174    0
tree  -0.008827835 -0.001674637 0.001420743 0.01140350 0.01691454 0.04918470    0

Precision 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.9845361 0.9855769 0.9885204 0.9876619 0.9885932 0.9910828    0
tree  0.9820993 0.9823293 0.9824281 0.9826814 0.9825119 0.9840383    0

Recall 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.5927075 0.6007752 0.6035687 0.6076647 0.6051202 0.6361521    0
tree  0.9363848 0.9487975 0.9534884 0.9508218 0.9565555 0.9588829    0

ROC 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.5772759 0.6391583 0.6480283 0.6462557 0.6620063 0.7048099    0
tree  0.4912470 0.4990226 0.5019395 0.5105331 0.5160169 0.5444396    0

Sens 
           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
logit 0.5927075 0.6007752 0.6035687 0.6076647 0.6051202 0.6361521    0
tree  0.9363848 0.9487975 0.9534884 0.9508218 0.9565555 0.9588829    0

Spec 
            Min.    1st Qu.     Median       Mean    3rd Qu.      Max. NA's
logit 0.47826087 0.50000000 0.60869565 0.57826087 0.60869565 0.6956522    0
tree  0.04347826 0.04347826 0.04347826 0.06884058 0.08333333 0.1304348    0

Добавлялsampling = "up" соответствующий курс действий?

...