Допустим, у нас есть таблица продуктов питания:
product_id <- c(1, 1, 2, 2, 3, 3)
name <- c("Cheddar", "Cheddar", "Apple", "Apple", "Pizza", "Pizza")
category <- c("Dairy", "Cheese", "Food", "Fruit", "Food", NA)
products <- data.frame(product_id, name, category)
data:image/s3,"s3://crabby-images/71f63/71f635681a7c6cc6ee9a8a175434e0b91a092b11" alt="enter image description here"
With categories set up in an irregular hierarchy:
level_1
data:image/s3,"s3://crabby-images/4b2da/4b2da32f331c57c7f105e1d7b98c66dfda3a9fa6" alt="enter image description here"
My eventual goal is to delete duplicate products, keeping the lowest hierarchy level (i.e. more detail).
data:image/s3,"s3://crabby-images/dfcbc/dfcbc6ef4d9f5386a0f5456827a94e47fa87a7fb" alt="enter image description here"
I don't necessarily need to keep the row with the most detail, just the label. So we could also just apply the most detailed category name to all rows in the group, and I can choose the row to remove later. But keep in mind there could be errors: we might have a row of Pizza = Fruit
and Pizza = Pizza
, which should just be ignored (that would need a manual fix).
Edit: The answers so far have been great, thank you for the help. There's just one thing missing from them:
In my real-world data I have errors in category, so I'm ignoring duplicates that are in different sections of the hierarchy tree. Imagine there's another section of this hierarchy for clothing > pants > jeans
. Then if I had these product duplicates:
+---------+----------+
| Product | Category |
+---------+----------+
| Apple | Food |
+---------+----------+
| Apple | Jeans |
+---------+----------+
I wouldn't want to keep "Jeans", even though it's a more specific category.
The only solution I can think of is this (and I don't know how to implement it in R):
- Put every level of hierarchy on the products table, and populate based on category
- Group by product
- Check that all rows in group match at level_1
- If yes, check level_2, if yes check level_3
- At each stage, if the mismatch is due to an NA, we have a winner and apply the existing category at that level
- If the mismatch is due to different categories, leave it
Alternatively, a solution could be a new column for the "highest-level common category", if that's an easier way to think about it.
Edit #2 - New datasets
product_id
data:image/s3,"s3://crabby-images/8a35c/8a35cb5d63e5e9198b3d4e3079190ec9fc1797c3" alt="enter image description here"
level_1
data:image/s3,"s3://crabby-images/8b7de/8b7dea5cc5cd6aed759727cf57e70905ffaf7d94" alt="enter image description here"
Goal:
data:image/s3,"s3://crabby-images/8a98d/8a98d3cba6376da40272842ec9a2c1b5f37a87d7" alt="enter image description here"
OR
введите описание изображения здесь