Если мы немного изменим ваш первый ввод, мы можем использовать пакеты fuzzyjoin
/ dplyr
/ stringr
следующим образом:
df1 <- data.frame(
Category = "Stationary",
Question = "Where do I get stationary items from?",
Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
stringsAsFactors = FALSE
)
df2 <- data.frame(
Category = c("Stat1", "Stat1"),
Question = c("Where to get books?", "Procedure to order stationary?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
stringsAsFactors = FALSE
)
Мы создаем шаблон регулярного выражения из Answer
:
df1 <- dplyr::mutate(
df1,
Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
Answer_regex = paste0("^",Answer_regex,"$")) # make sure the match is exact
Мы используем stringr::str_detect
с fuzzyjoin::fuzzy_left_join
, чтобы найти совпадения:
res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
# Category.x Question.x Answer.x Category.y
# 1 Stat1 Where to get books? Hey Anil, you will find it at the helpdesk. Stationary
# 2 Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
# Question.y Answer.y Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
Тогда мы можем посчитать:
dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
# Answer.y n
# <chr> <int>
# 1 Hey <firstname>, you will find it <here>. 2
Обратите внимание, что я включил пробелы за пределами <
и >
как часть заполнителей. Если бы я этого не делал, "Hey, Shekhar"
не было бы совпадений из-за запятой.
изменить на адрес комментария:
df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"),
match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
# Category.x Question.x Answer.x Category.y
# 1 Stat1 Where to get books? Hey Anil, you will find it at the helpdesk. Stationary
# 2 Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. <NA>
# Question.y Answer.y Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2 <NA> <NA> <NA>
dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
# Answer.y n
# <chr> <int>
# 1 <NA> 1
# 2 Hey <firstname>, you will find it here. 1