У меня есть такой ввод данных:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))
Я ожидал получить такой результат:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product stock","yahoo product stock, amazon","yahoo product stock", "google stock"))
combination frequency
1 google stock - yahoo product stock 2
2 amazon - yahoo product stock 2
3 yahoo product stock 1
4 google stock 1
Я пробовал это:
library(tidyverse)
df %>%
separate_rows(stocks, sep = ",") %>%
full_join(df %>%
separate_rows(stocks), by = c("id" = "id")) %>%
filter(stocks.x != stocks.y) %>%
count(stocks.x, stocks.y) %>%
transmute(stocks = paste(pmax(stocks.x, stocks.y), pmin(stocks.x, stocks.y), sep = "-"),
n) %>%
distinct(stocks, .keep_all = TRUE)
но я получаю этот результат
# A tibble: 16 x 2
stocks n
<chr> <int>
1 amazon- yahoo product 2
2 product- yahoo product 2
3 yahoo- yahoo product 2
4 google- yahoo product stock 2
5 product- yahoo product stock 2
6 stock- yahoo product stock 4
7 yahoo- yahoo product stock 2
8 product-amazon 2
9 yahoo-amazon 2
10 google stock-google 3
11 product-google stock 2
12 stock-google stock 5
13 yahoo-google stock 2
14 yahoo product stock-product 1
15 yahoo product stock-stock 1
16 yahoo product stock-yahoo 1
Использование table()
не является оптимальным решением для моего случая, так как мой реальный набор данных больше данных