Question

Допустим, у нас есть таблица продуктов питания:

product_id <- c(1, 1, 2, 2, 3, 3)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA)
products <- data.frame(product_id, name, category)

With categories set up in an irregular hierarchy:

level_1

My eventual goal is to delete duplicate products, keeping the lowest hierarchy level (i.e. more detail).

I don't necessarily need to keep the row with the most detail, just the label. So we could also just apply the most detailed category name to all rows in the group, and I can choose the row to remove later. But keep in mind there could be errors: we might have a row of Pizza = Fruit and Pizza = Pizza, which should just be ignored (that would need a manual fix).

Edit: The answers so far have been great, thank you for the help. There's just one thing missing from them:

In my real-world data I have errors in category, so I'm ignoring duplicates that are in different sections of the hierarchy tree. Imagine there's another section of this hierarchy for clothing > pants > jeans. Then if I had these product duplicates:

+---------+----------+
| Product | Category | 
+---------+----------+
|  Apple  |   Food   |
+---------+----------+
|  Apple  |   Jeans  |
+---------+----------+

I wouldn't want to keep "Jeans", even though it's a more specific category.

The only solution I can think of is this (and I don't know how to implement it in R):

Put every level of hierarchy on the products table, and populate based on category
Group by product
Check that all rows in group match at level_1
If yes, check level_2, if yes check level_3
At each stage, if the mismatch is due to an NA, we have a winner and apply the existing category at that level
If the mismatch is due to different categories, leave it

Alternatively, a solution could be a new column for the "highest-level common category", if that's an easier way to think about it.

Edit #2 - New datasets

product_id

level_1

Goal:

OR

введите описание изображения здесь

Martin Gal · Answer 1 · 14 июля 2020

Другой вариант dplyr / tidyr может быть

products %>%
  mutate(level = case_when(category %in% level_1 ~ 1,
                           category %in% level_2 ~ 2,
                           category %in% level_3 ~ 3
                           )) %>%
  group_by(product_id) %>%
  drop_na() %>%
  slice_max(level)

, который возвращает

# A tibble: 3 x 4
# Groups:   product_id [3]
  product_id name    category level
       <dbl> <chr>   <chr>    <dbl>
1          1 Cheddar Cheese       3
2          2 Apple   Fruit        2
3          3 Pizza   Food         1

akrun · Answer 2 · 14 июля 2020

Мы можем перейти к «длинному» формату, arrange строки в правильном порядке

library(dplyr)
library(tidyr)
newdat <- categories %>%
              pivot_longer(everything(), names_to = 'product_id',
                       values_drop_na = TRUE) %>% 
              distinct  %>%
              arrange(factor(product_id, levels = rev(names(categories))))

использовать это для match и slice второго набора данных «продукты» после группировки по product_id ',' name '

products %>%
   group_by(product_id, name) %>% 
   slice(na.omit(match(newdat$value, category))[1])
# A tibble: 3 x 3
# Groups:   product_id, name [3]
#  product_id name    category
#       <dbl> <chr>   <chr>   
#1          1 Cheddar Cheese  
#2          2 Apple   Fruit   
#3          3 Pizza   Food

R - найти самый низкий / самый высокий уровень в иерархии в group_by

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

R - найти самый низкий / самый высокий уровень в иерархии в group_by

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы