Управление NA в кадре данных с помощью glmnet (каретка) - PullRequest
0 голосов
/ 21 июня 2019

Я в настоящее время нахожусь в тупике.

У меня есть фрейм данных с категориальными данными, и я пытаюсь выбрать функции с помощью glmnet (используя пакет R Caret).Однако все строки моего фрейма данных содержат хотя бы один NA.

. Я имел в виду следующие шаги:

### Reproducible example data frame
set.seed(123)

library(earth)
library(RANN)
library(caret)
library(tidyverse)

data(etitanic)
df <- etitanic[,-4]
df <- df[,c(2,1,3,4)]
OUTCOME <- df[,1]
x <- df[,c(2:4)]
x <- as.data.frame(lapply(x, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.20), size = length(cc), replace = TRUE) ]))
colnames(x) <- c("Predictor_1", "Predictor_2", "Predictor_3")
df <- cbind(OUTCOME, x)
df <- as.data.frame(sapply(df, as.factor))
df <- df[rowSums(is.na(df)) > 0,]

head(df)
  OUTCOME Predictor_1 Predictor_2 Predictor_3
3       0         1st        <NA>           1
4       0        <NA>        <NA>           1
5       0        <NA>      female        <NA>
6       1         1st        <NA>           0
7       1         1st        <NA>           1
8       0        <NA>        male           0


### STEP 1: convert categorical variables into dummy variables
x <- model.frame(OUTCOME ~ ., df, na.action=NULL)[,-1]

# since all rows contain at least one NA, the data frame remains unchanged


### STEP 2: Partitioning & imputing missing values
trainRowNumbers <- createDataPartition(df$OUTCOME, p=0.8, list=FALSE)
trainData <- df[trainRowNumbers,]
testData <- df[-trainRowNumbers,]

preProcess_missingdata_model <- preProcess(trainData, method='knnImpute')

# Warning in pre_process_options(method, column_types) :
#   The following pre-processing methods were eliminated:
# 'knnImpute', 'center', 'scale'

trainData <- predict(preProcess_missingdata_model, newdata = trainData)
testData <- predict(preProcess_missingdata_model, testData)


### STEP 3: build the model
# Setup a grid range of lambda values
lambda <- 10^seq(-3, 3, length = 100)

# Splitting parameters of the trainData
control <- trainControl(
                method="repeatedcv", 
                number=10, 
                repeats=3,
                savePredictions='final',
                summaryFunction=multiClassSummary
            )

ridge <- train(
    x,
    df$OUTCOME,
    method = "glmnet",
    trControl = control,
    tuneGrid = expand.grid(alpha = 0, lambda = lambda),
    na.action = na.pass
)

# Something is wrong; all the Accuracy metric values are missing: ...
# Warnings:
# ...
# 30: model fit failed for Fold10.Rep3: alpha=0, lambda=1000 Error in (function (x, y, family = c("gaussian", "binomial", "poisson",  :
  unused argument (na.action = function (object, ...)
object)
#
# 31: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  ... :
  There were missing values in resampled performance measures.

Есть ли способ решить эту ситуацию?

...