Я пытаюсь токенизировать длинные предложения:
dat <- data.frame(text = c("hi i am Apple, not an orange. that is an orange","hello i am banana, not an pineapple. that is an pineapple"),
received = c(1, 0))
dat <- dat %>%
mutate(token = sent_detect(text, language = "en"))
, но получаю эту ошибку:
Error: Column `token` must be length 2 (the number of rows) or one, not 3
Это потому, что str_detect функция возвращает список предложений, которые не соответствуют длине исходного фрейма данных.
library(openNLP)
library(NLP)
sent_detect <- function(text, language) {
# Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'.
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language)
# Convert text to class String from package NLP
text <- as.String(text)
# Sentence boundaries in text
sentence.boundaries <- annotate(text, sentence_token_annotator)
# Extract sentences
sentences <- text[sentence.boundaries]
# return sentences
return(sentences)
}
Я изучаю purrr :: map, но не знаю, как применить его в этой ситуации.
Я ожидаю результата, который будет выглядеть примерно так:
text received token
"hi i am Apple, not an orange. that is an orange" 1 "hi i am Apple, not an orange."
"hi i am Apple, not an orange. that is an orange" 1 "that is an orange"
"hello i am banana, not an pineapple. that is an pineapple" 0 "hello i am banana, not an pineapple."
"hello i am banana, not an pineapple. that is an pineapple" 0 "that is an pineapple"