Как определить шаблон и частоту в столбце символов с помощью R? - PullRequest
1 голос
/ 30 мая 2020

У меня есть df, который показывает "цепочку действий" людей, которая выглядит так (сниппет в конце вопроса):

head(agents)
   id                                                                                                                                                                leg_activity
1   9                                                                                      home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home
2  10 home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home
3  11                                                                                                                                                      home, work, adpt, home
4  96                                                                                                                                home, car, work, car, home, work, adpt, home
5  97                              home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home
6 101                                       home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home

Меня интересует обнаружение закономерности вхождений adpt. Самый простой способ - использовать функцию count(), которая дает мне на выходе таблицу частот. К сожалению, этот результат может ввести в заблуждение.

Вот как это выглядит:

x                                 freq
home, adpt, work, adpt, home      2071
home, adpt, shop, adpt, home      653
home, adpt, education, adpt, home 545
home, pt, work, adpt, home        492
home, adpt, work, pt, home        468
home, adpt, work, home            283

Проблема с этим подходом в том, что я не могу обнаружить закономерности в более длинных цепочках действий; например:

 home, adpt, education, adpt, education, adpt, home, car, work, car, home, shop, adpt, home

В этом случае в начале есть цепочка действий, которая встречается очень часто, но по мере выполнения дальнейших действий она не учитывается функцией count.

Есть ли способ использовать функцию подсчета, которая также учитывает, что происходит внутри ячейки? Поэтому было бы интересно иметь таблицу, в которой показаны все возможные комбинации и их частота, например:

x                                freq
home, adpt, home                 10
home, adpt, home, pt, work, home 4
home, pt, work, home             2

Большое спасибо за помощь!

данные:

structure(list(id = c(9L, 10L, 11L, 96L, 97L, 101L, 103L, 248L, 
499L, 1044L, 1215L, 1238L, 1458L, 1569L, 1615L, 1626L, 1734L, 
1735L, 1790L, 1912L, 9040L, 14858L, 14859L, 14967L, 15011L, 15012L, 
15015L, 15045L, 15050L, 15058L, 15060L, 15086L, 15088L, 15094L, 
15109L, 15113L, 15152L, 15157L, 15192L, 15193L, 15222L, 15230L, 
15231L, 15234L, 15235L, 15237L, 15256L, 15257L, 15258L, 15269L, 
15270L, 15318L, 15319L, 15338L, 15369L, 15371L, 15396L, 15397L, 
15399L, 15404L, 15505L, 15506L, 15515L, 15516L, 15525L, 15542L, 
15593L, 15602L, 15608L, 15643L, 15667L, 15727L, 15728L, 15729L, 
15752L, 15775L, 15808L, 15851L, 15869L, 15881L, 15882L, 15960L, 
15962L, 15966L, 16058L, 16107L, 16174L, 16229L, 16237L, 16238L, 
16291L, 16333L, 16416L, 16418L, 16449L, 16450L, 16451L, 16491L, 
16506L, 16508L), leg_activity = c("home, adpt, shop, car_passenger, home, adpt, work, adpt, home, work, outside, pt, home", 
"home, pt, outside, pt, home, car, leisure, car, other, car, leisure, car, leisure, car, other, car, leisure, car, other, car, leisure, car, home, adpt, leisure, adpt, home", 
"home, work, adpt, home", "home, car, work, car, home, work, adpt, home", 
"home, adpt, work, car_passenger, leisure, car_passenger, work, adpt, home, car_passenger, outside, car_passenger, outside, car_passenger, home", 
"home, bike, outside, car_passenger, outside, car_passenger, outside, bike, home, adpt, leisure, adpt, home, bike, leisure, bike, home", 
"home, adpt, work, adpt, home, walk, other, pt, home", "home, adpt, work, walk, home, adpt, work, walk, home", 
"home, adpt, leisure, adpt, home, bike, outside, bike, home", 
"home, pt, work, adpt, home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, outside, car, work, car, work, car, home", 
"home, work, leisure, adpt, home", "home, outside, pt, home, adpt, leisure, adpt, home", 
"home, car_passenger, leisure, walk, work, walk, leisure, walk, work, adpt, home, walk, home", 
"home, adpt, work, walk, work, walk, work, pt, home", "home, car, work, pt, leisure, adpt, work, car, home, car, home", 
"home, adpt, other, adpt, home, car, home", "home, adpt, other, adpt, home", 
"home, education, walk, shop, walk, education, pt, outside, home, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, adpt, work, pt, leisure, adpt, work, adpt, work, adpt, home, adpt, other, walk, home", 
"home, adpt, work, adpt, home, adpt, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, work, adpt, home, walk, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, outside, car_passenger, leisure, car_passenger, home, car_passenger, home", 
"home, adpt, other, adpt, home, car, work, car, home", "home, adpt, education, adpt, leisure, adpt, home, walk, leisure, walk, home", 
"home, car_passenger, other, pt, home, walk, other, walk, home, car_passenger, other, walk, home, adpt, other, adpt, home", 
"home, work, pt, work, adpt, work, adpt, home", "home, adpt, leisure, adpt, home, car, shop, car, other, car, home", 
"home, adpt, work, adpt, home, walk, other, adpt, home", "home, adpt, work, adpt, home, car_passenger, leisure, car_passenger, home", 
"home, car, other, car, home, adpt, shop, adpt, home", "home, pt, work, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, walk, education, adpt, home, walk, education, walk, home, bike, leisure, bike, home", 
"home, adpt, shop, adpt, home, car, home", "home, adpt, leisure, walk, leisure, walk, leisure, adpt, home", 
"home, adpt, shop, pt, home, adpt, other, adpt, home", "home, adpt, other, adpt, home, car_passenger, leisure, walk, home", 
"home, adpt, work, adpt, home, car_passenger, shop, car_passenger, home", 
"home, adpt, other, adpt, work, adpt, home", "home, adpt, work, adpt, home, adpt, other, walk, shop, walk, home, car, outside, car, outside, car, outside, car, home", 
"home, adpt, other, adpt, home", "home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, pt, work, adpt, work, adpt, work, adpt, work, adpt, home, adpt, work, adpt, home", 
"home, walk, other, car_passenger, education, walk, home, car_passenger, education, adpt, home", 
"home, walk, shop, walk, home, walk, leisure, adpt, leisure, adpt, home", 
"home, adpt, work, adpt, home, walk, shop, walk, home, walk, leisure, walk, home, walk, home", 
"home, adpt, leisure, adpt, home", "home, walk, leisure, walk, home, adpt, other, adpt, shop, walk, leisure, walk, home", 
"home, pt, leisure, adpt, home, pt, outside, pt, home, bike, leisure, bike, home", 
"home, pt, outside, pt, home, walk, home, walk, other, adpt, shop, pt, home, car_passenger, leisure, adpt, home", 
"home, adpt, work, adpt, home, adpt, shop, adpt, work, adpt, home", 
"home, adpt, shop, adpt, other, walk, home", "home, walk, other, walk, home, walk, home, adpt, other, adpt, home, adpt, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, adpt, leisure, pt, home", "home, leisure, adpt, home", 
"home, adpt, leisure, pt, shop, walk, home, walk, shop, walk, home", 
"home, car, outside, car, outside, leisure, car, outside, car, outside, car, home, adpt, other, adpt, home", 
"home, adpt, work, adpt, shop, walk, home", "home, adpt, other, walk, work, adpt, home, adpt, other, adpt, work, adpt, home, adpt, leisure, adpt, home", 
"home, adpt, leisure, adpt, home, car, shop, car, home", "home, walk, shop, adpt, home, car, other, car, home, adpt, other, adpt, home", 
"home, walk, leisure, walk, home, adpt, work, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, leisure, pt, shop, adpt, home, adpt, leisure, walk, home", 
"home, walk, other, walk, leisure, walk, home, car, leisure, car, home, walk, leisure, adpt, home", 
"home, adpt, work, adpt, home", "home, walk, leisure, walk, home, adpt, leisure, adpt, home, adpt, leisure, walk, home", 
"home, walk, home, walk, shop, walk, home, walk, leisure, walk, home, adpt, other, adpt, home", 
"home, car_passenger, outside, car_passenger, outside, car_passenger, home, adpt, other, adpt, home", 
"home, walk, education, adpt, home", "home, adpt, education, walk, home, bike, education, bike, home", 
"home, adpt, other, adpt, home, adpt, shop, pt, home", "home, adpt, other, adpt, shop, walk, home, adpt, leisure, car_passenger, home", 
"home, adpt, work, adpt, other, adpt, home", "home, adpt, work, adpt, home", 
"home, adpt, work, adpt, home, walk, home", "home, car, work, adpt, leisure, adpt, work, car, home", 
"home, adpt, shop, adpt, home, car, other, car, home, car_passenger, outside, car_passenger, home", 
"home, adpt, work, pt, home, car, shop, car, home", "home, walk, other, adpt, work, adpt, shop, adpt, shop, adpt, home", 
"home, adpt, leisure, adpt, shop, adpt, leisure, pt, home", "home, adpt, leisure, adpt, shop, adpt, home", 
"home, car, outside, car, outside, car, outside, car, outside, car, home, adpt, education, pt, home", 
"home, adpt, work, adpt, home", "home, adpt, shop, adpt, home", 
"home, adpt, education, adpt, home, adpt, education, adpt, home", 
"home, adpt, other, adpt, other, walk, leisure, adpt, other, adpt, home", 
"home, adpt, work, adpt, home", "home, adpt, work, adpt, home, car, other, car, home", 
"home, car, work, car, shop, car, home, adpt, work, adpt, home, car, home", 
"home, walk, other, walk, education, adpt, home, adpt, education, walk, home, walk, home", 
"home, adpt, shop, walk, leisure, adpt, home", "home, adpt, shop, walk, home, adpt, work, adpt, home", 
"home, adpt, leisure, adpt, shop, walk, home", "home, walk, other, adpt, shop, walk, home, walk, other, walk, home, walk, other, walk, other, adpt, home", 
"home, adpt, education, walk, home, walk, education, walk, home, walk, home", 
"home, bike, education, bike, home, adpt, education, adpt, home, walk, home"
)), row.names = c(NA, 100L), class = "data.frame")

1 Ответ

1 голос
/ 30 мая 2020

Я не совсем уверен, что именно вы хотите сделать, но я понимаю, что вы заинтересованы в выявлении закономерностей возникновения активности adpt. Это часто делается в НЛП, ниже представлено решение с использованием пакета tidytext. Я разбиваю столбец leg_activity на то, что называется n-grams, то есть разбиваю текст на последовательную последовательность слов. Последовательность из двух последовательных слов называется bi-gram, три последовательных слова tri-gram et c. Когда мы затем подсчитываем эти n-grams, мы узнаем, какие действия чаще всего предшествуют adpt, а какие - после adpt.

Вот как это сделать для bi-grams:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 2) %>% 
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

           bigram   n
1       home adpt 100
2       adpt home  97
3       work adpt  51
4       adpt work  48
5    leisure adpt  27
6      adpt other  26
7      other adpt  26
8    adpt leisure  24
9       adpt shop  22
10      shop adpt  13
11 adpt education  10
12 education adpt  10

Таким образом, adpt чаще всего предшествует «home», а «home» также чаще всего идет сразу после «adpt». Если бы нас интересовали три действия, последовательно происходящие вместе и включающие «adpt», мы можем сделать то же самое для tri-grams:

df %>% 
  unnest_tokens(bigram, leg_activity, token = "ngrams", n = 3) %>%  #n is the only thing that changed
  filter(str_detect(bigram, "adpt")) %>% 
  count(bigram, sort = TRUE)

                    bigram  n
1                work adpt home 42
2                adpt work adpt 40
3                home adpt work 36
4               home adpt other 22
5               adpt other adpt 21
6             home adpt leisure 20
7             leisure adpt home 19
8               other adpt home 18
9             adpt leisure adpt 16
10               adpt home adpt 15
11               home adpt shop 12
12                adpt home car 11
13               adpt home walk 11
14               adpt shop adpt 11
15          home adpt education 10
16          education adpt home  9
[list continues]

Этот список значительно длиннее, поскольку теперь существует больше возможных комбинаций. Здесь ссылка на хороший учебник по n-граммам, если вы хотите узнать больше. Это то, что вы хотели сделать?

...