R - найти самый низкий / самый высокий уровень в иерархии в group_by - PullRequest
1 голос
/ 14 июля 2020

Допустим, у нас есть таблица продуктов питания:

product_id <- c(1, 1, 2, 2, 3, 3)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA)
products <- data.frame(product_id, name, category)

enter image description here

With categories set up in an irregular hierarchy:

level_1 

enter image description here

My eventual goal is to delete duplicate products, keeping the lowest hierarchy level (i.e. more detail).

enter image description here

I don't necessarily need to keep the row with the most detail, just the label. So we could also just apply the most detailed category name to all rows in the group, and I can choose the row to remove later. But keep in mind there could be errors: we might have a row of Pizza = Fruit and Pizza = Pizza, which should just be ignored (that would need a manual fix).


Edit: The answers so far have been great, thank you for the help. There's just one thing missing from them:

In my real-world data I have errors in category, so I'm ignoring duplicates that are in different sections of the hierarchy tree. Imagine there's another section of this hierarchy for clothing > pants > jeans. Then if I had these product duplicates:

+---------+----------+
| Product | Category | 
+---------+----------+
|  Apple  |   Food   |
+---------+----------+
|  Apple  |   Jeans  |
+---------+----------+

I wouldn't want to keep "Jeans", even though it's a more specific category.

The only solution I can think of is this (and I don't know how to implement it in R):

  • Put every level of hierarchy on the products table, and populate based on category
  • Group by product
  • Check that all rows in group match at level_1
  • If yes, check level_2, if yes check level_3
  • At each stage, if the mismatch is due to an NA, we have a winner and apply the existing category at that level
  • If the mismatch is due to different categories, leave it

Alternatively, a solution could be a new column for the "highest-level common category", if that's an easier way to think about it.


Edit #2 - New datasets

product_id 

enter image description here

level_1 

enter image description here

Goal:

enter image description here

OR

введите описание изображения здесь

Ответы [ 2 ]

1 голос
/ 14 июля 2020

Другой вариант dplyr / tidyr может быть

products %>%
  mutate(level = case_when(category %in% level_1 ~ 1,
                           category %in% level_2 ~ 2,
                           category %in% level_3 ~ 3
                           )) %>%
  group_by(product_id) %>%
  drop_na() %>%
  slice_max(level)

, который возвращает

# A tibble: 3 x 4
# Groups:   product_id [3]
  product_id name    category level
       <dbl> <chr>   <chr>    <dbl>
1          1 Cheddar Cheese       3
2          2 Apple   Fruit        2
3          3 Pizza   Food         1
1 голос
/ 14 июля 2020

Мы можем перейти к «длинному» формату, arrange строки в правильном порядке

library(dplyr)
library(tidyr)
newdat <- categories %>%
              pivot_longer(everything(), names_to = 'product_id',
                       values_drop_na = TRUE) %>% 
              distinct  %>%
              arrange(factor(product_id, levels = rev(names(categories))))

использовать это для match и slice второго набора данных «продукты» после группировки по product_id ',' name '

products %>%
   group_by(product_id, name) %>% 
   slice(na.omit(match(newdat$value, category))[1])
# A tibble: 3 x 3
# Groups:   product_id, name [3]
#  product_id name    category
#       <dbl> <chr>   <chr>   
#1          1 Cheddar Cheese  
#2          2 Apple   Fruit   
#3          3 Pizza   Food    
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...