Вот (довольно сложное) решение (довольно сложной проблемы):
Данные:
df <- data.frame(
id = 1:2,
amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}',
'{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}'))
Подготовка данных:
amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities
df[amenities_unique] <- NA # set up the columns for each amenity
Теперь для Основа анализа, используя str_detect
из пакета stringr
:
# record presence/absence of individual amenities in each new column:
library(stringr)
for(i in 1:ncol(df[amenities_unique])){
for(j in 1:nrow(df)){
df[amenities_unique][j,i] <-
ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
}
}
Это вызовет предупреждения, но они кажутся незначительными, так как результат правильный:
df
id
1 1
2 2
amenities
1 {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}
2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed","Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide}
Wireless Internet Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating Washer Dryer
1 1 1 1 1 1 1 1 1
2 1 0 1 0 1 1 0 0
Essentials Shampoo Hangers Laptop friendly workspace TV Cable TV Internet Air conditioning Smoking allowed
1 1 1 1 1 0 0 1 0 0
2 0 0 0 0 1 1 1 1 1
Pets allowed Family/kid friendly Smoke detector Carbon monoxide
1 0 0 0 0
2 1 1 1 1
РЕДАКТИРОВАТЬ :
В качестве альтернативы и, возможно, более экономно, вместо вложенной for
l oop вы можете использовать функцию apply
, подобную этой (на основе векторов amenities_split
и amenities_unique
от фазы приготовления первого раствора):
cbind(df, t(sapply(amenities_split, function(x)
table(factor(x, levels = amenities_unique)))))