Мы можем использовать separate_rows
из tidyr
, чтобы разбить Varieties.grown
на несколько строк, а затем соединить правой кнопкой с таблицей varieties
, чтобы получить только Varieties
, который нас интересует.Наконец, group_by(Varieties.grown)
и посчитайте все не-NA Orchard.Name
, чтобы получить Count
на Varieties
:
library(dplyr)
library(tidyr)
df %>%
separate_rows(Varieties.grown, sep = "\\s?,\\s?") %>%
right_join(varieties, by = c("Varieties.grown"="Varieties")) %>%
group_by(Varieties.grown) %>%
summarize(Count = sum(!is.na(Orchard.Name))) %>%
rename(Varieties = Varieties.grown)
Результат:
# A tibble: 5 x 2
Varieties Count
<chr> <int>
1 Cara Cara 2
2 Juice Orange 0
3 Mandarin 1
4 Seville 1
5 Tangerine 1
Данные:
df = structure(list(Orchard.Name = c("Orchard 1", "Orchard 2", "Orchard 3"
), City.Name = c("City", "City", "City"), State.Name = c("State",
"State", "State"), Varieties.grown = c("Cara Cara, Mandarin, Juice, Tangerine",
"Cara Cara", "Seville")), class = "data.frame", .Names = c("Orchard.Name",
"City.Name", "State.Name", "Varieties.grown"), row.names = c(NA,
-3L))
varieties = structure(list(Varieties = c("Cara Cara", "Mandarin", "Seville",
"Juice Orange", "Tangerine")), .Names = "Varieties", row.names = c(NA,
-5L), class = "data.frame")