Основная проблема заключается в том, что вы дали своему текстовому столбцу имя data
, а затем назвали его text
. Попробуйте что-то вроде этого:
library(tidyverse)
library(tidytext)
text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)
remove_reg <- "&|<|>"
data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords()) %>%
filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#> word
#> <chr>
#> 1 said
#> 2 cant
#> 3 lil
#> 4 dance
#> 5 party
#> 6 stuck
#> 7 quarantine
#> 8 happy
#> 9 friday
#> 10 cousins
#> # … with 95 more rows
Если вы особенно заинтересованы в данных Twitter, рассмотрите возможность использования token = "tweets"
:
data_df %>%
unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#> word
#> <chr>
#> 1 who
#> 2 said
#> 3 we
#> 4 cant
#> 5 have
#> 6 a
#> 7 lil
#> 8 dance
#> 9 party
#> 10 while
#> # … with 111 more rows
Создано в 2020-04-12 (v0.3.0)
Эта опция хорошо обрабатывает хэштеги и имена пользователей.