Ускорение разделения и слияния строк данных в R - PullRequest
0 голосов
/ 27 февраля 2020

У меня есть данные, по которым я хочу разделить строки.

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

Я хочу разбить предложения в текстовом столбце и получить следующее:

df <- data.frame (text = c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..", 
                            "I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment.", 
                            "Lately, I haven't been able to view my Online Payment Card.", 
                            "It's prompting me to have to upgrade my account whereas before it didn't.", 
                            "I have used the Card at various online stores before and have successfully used it.", 
                            "But now it's starting to get very frustrating that I have to said upgrade my account.", 
                            "Do fix this|", "**I noticed some users have the same issue|", 
                            "I've been using this app for almost 2 years without any problems.", 
                            "Until, their system just blocked my virtual paying card without any notice.", 
                            "So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs.", 
                            "This app has been a big disappointment."), id = c(1, 2, 1, 1, 
                                                                               1, 1, 1, 1, 2, 2, 2, 2), tag = c("DONE", "DONE", NA, NA, NA, 
                                                                                                                NA, NA, NA, NA, NA, NA, NA), stringsAsFactors = FALSE)

Я сделал это с помощью этого кода, однако я думаю, что -l oop так медленно. Мне нужно сделать это для 73 000 строк. Поэтому мне нужен более быстрый подход. Попытка 1:

library("qdap")
df$tag <- NA
for (review_num in 1:nrow(df)) {
  x = sent_detect(df$text[review_num])
  if (length(x) > 1) {
    for (sentence_num in 1:length(x)) {
      df <- rbind(df, df[review_num,])
      df$text[nrow(df)]   <- x[sentence_num]
    }
    df$tag[review_num] <- "DONE"
  }
}

Попытка 2: строки: 73000, затраченное время: 252 минуты или ~ 4 часа

reviews_df1 <- data.frame(id=character(0), text=character(0))
for (review_num in 1:nrow(df)) {
preprocess_sent <- sent_detect(df$text[review_num])
if (length(preprocess_sent) > 0) {
        x <- data.frame(id=df$id[review_num],
                        text=preprocess_sent)
        reviews_df <- rbind(reviews_df1, x)
      }
     colnames(reviews_df) <- c("id", "text")
}

Попытка 3: строки: 29000, затраченное время: 170 минут или ~ 2,8 часа

library(qdap)
library(dplyr)
library(tidyr)

df <- data.frame(text=c("Lately, I haven't been able to view my Online Payment Card. It's prompting me to have to upgrade my account whereas before it didn't. I have used the Card at various online stores before and have successfully used it. But now it's starting to get very frustrating that I have to said \"upgrade\" my account. Do fix this... **I noticed some users have the same issue..","I've been using this app for almost 2 years without any problems. Until, their system just blocked my virtual paying card without any notice. So, I was forced to apply for an upgrade and it was rejected thrice, despite providing all of my available IDs. This app has been a big disappointment."), id=c(1,2), stringsAsFactors = FALSE)

df %>%
  group_by(text) %>% 
  mutate(sentences = list(sent_detect(df$text))) %>% 
  unnest(cols=sentences) -> out.df

out.df

1 Ответ

0 голосов
/ 28 февраля 2020

Мне кажется странным, что это займет так много времени. Вы можете превратить ваш ввод в список и использовать mclapply (если вы не на Windows), чтобы еще больше ускорить процесс. Вот пример использования data.table и parallel::mclapply на Womens Clothing E-Commerce Reviews.csv (23 тыс. Строк). Это занимает около 21 секунды с отставанием и 5,5 секунд с mclapply на 4 ядрах. Конечно, это не очень длинные обзоры и предложения, но они демонстрируют полезность параллельного запуска вещей.

library(data.table)
library(parallel)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#> Loading required package: qdapTools
#> 
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> Loading required package: RColorBrewer
#> Registered S3 methods overwritten by 'qdap':
#>   method               from
#>   t.DocumentTermMatrix tm  
#>   t.TermDocumentMatrix tm
#> 
#> Attaching package: 'qdap'
#> The following object is masked from 'package:base':
#> 
#>     Filter

dt <- fread("https://raw.githubusercontent.com/NadimKawwa/WomeneCommerce/master/Womens%20Clothing%20E-Commerce%20Reviews.csv")
system.time({
dfl <- setNames(as.list(dt$`Review Text`), dt$V1)
makeDT <- function(x) data.table(text = sent_detect(x))
out.dt <- rbindlist(mclapply(dfl, makeDT, mc.cores=4L), idcol = "id")
out.dt[, tag := NA_character_]
out.dt <- rbind(data.table(id=dt$V1, text=dt$`Review Text`, tag = "DONE"), out.dt)
})
#>    user  system elapsed 
#>  21.078   0.482   5.467
out.dt
#>            id
#>      1:     0
#>      2:     1
#>      3:     2
#>      4:     3
#>      5:     4
#>     ---      
#> 137388: 23484
#> 137389: 23484
#> 137390: 23484
#> 137391: 23485
#> 137392: 23485
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
#>      1:                                                                                                                                                                                                                                                                                                                                                                                                                                                                Absolutely wonderful - silky and sexy and comfortable
#>      2:                                                                                                                                                                                                     Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8"".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
#>      3: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
#>      4:                                                                                                                                                                                                                                                                                                                                                                                         I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
#>      5:                                                                                                                                                                                                                                                                                                                     This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!
#>     ---                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
#> 137388:                                                                                                                                                                                                                                                                                                                                                                                                                      the medium fits my waist perfectly, but was way too long and too big in the bust and shoulders.
#> 137389:                                                                                                                                                                                                                                                                                                                                                                                                              if i wanted to spend the money, i could get it tailored, but i just felt like it might not be worth it.
#> 137390:                                                                                                                                                                                                                                                                                                                                                                                               side note - this dress was delivered to me with a nordstrom tag on it and i found it much cheaper there after looking!
#> 137391:                                                                                                                                                                                                                                                                                                                                                                                                                         This dress in a lovely platinum is feminine and fits perfectly, easy to wear and comfy, too!
#> 137392:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    highly recommend!
#>          tag
#>      1: DONE
#>      2: DONE
#>      3: DONE
#>      4: DONE
#>      5: DONE
#>     ---     
#> 137388: <NA>
#> 137389: <NA>
#> 137390: <NA>
#> 137391: <NA>
#> 137392: <NA>

Если подумать, возможно, ваш код - проблема - попробуйте изменить

df %>%
    group_by(text) %>% 
    mutate(sentences = list(sent_detect(df$text))) %>% 
    unnest(cols=sentences) -> out.df

до

df %>%
    group_by(text) %>% 
    mutate(sentences = list(sent_detect(text))) %>% 
    unnest(cols=sentences) -> out.df

и посмотрите, не виноват ли это (я так думаю).

...