Question

Я пытаюсь токенизировать длинные предложения:

dat <- data.frame(text = c("hi i am Apple, not an orange. that is an orange","hello i am banana, not an pineapple. that is an pineapple"),
                  received = c(1, 0))

dat <- dat %>%
  mutate(token = sent_detect(text, language = "en"))

, но получаю эту ошибку:

Error: Column `token` must be length 2 (the number of rows) or one, not 3

Это потому, что str_detect функция возвращает список предложений, которые не соответствуют длине исходного фрейма данных.

library(openNLP)
library(NLP)

sent_detect <- function(text, language) {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

Я изучаю purrr :: map, но не знаю, как применить его в этой ситуации.

Я ожидаю результата, который будет выглядеть примерно так:

text                                                    received    token
"hi i am Apple, not an orange. that is an orange"           1       "hi i am Apple, not an orange."
"hi i am Apple, not an orange. that is an orange"           1       "that is an orange"
"hello i am banana, not an pineapple. that is an pineapple" 0       "hello i am banana, not an pineapple."
"hello i am banana, not an pineapple. that is an pineapple" 0       "that is an pineapple"

phiver · Answer 1 · 28 мая 2020

Использование tidyr + purrr приведет вас туда. map создаст вложенный вывод, который можно вывести на более высокий уровень с помощью unnest из tidyr.

library(tidyr)

dat %>% 
  mutate(sentences = purrr::map(text, sent_detect, "en")) %>% 
  unnest(sentences)


# A tibble: 4 x 3
  text                                                      received sentences                           
  <chr>                                                        <dbl> <chr>                               
1 hi i am Apple, not an orange. that is an orange                  1 hi i am Apple, not an orange.       
2 hi i am Apple, not an orange. that is an orange                  1 that is an orange                   
3 hello i am banana, not an pineapple. that is an pineapple        0 hello i am banana, not an pineapple.
4 hello i am banana, not an pineapple. that is an pineapple        0 that is an pineapple

Столбец `token` должен иметь длину 2 (количество строк) или один, а не 3.

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Столбец `token` должен иметь длину 2 (количество строк) или один, а не 3.

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов