Частичное или нечеткое совпадение в R - PullRequest
0 голосов
/ 22 марта 2019

Я хочу провести нечеткое сопоставление на 2 фреймах данных (s1 - данные, а s2 - ссылка) на основе порядка столбцов «Ответ», чтобы получить соответствующий счетчик вопросов и категорию из s2.Пример:

s1 <- data.frame(Category =c("Stationary","TransferRelocationClaim","IMS"),
Question =c( "Where do I get stationary items from?","Process for claiming Transfer relocation allowances.","What is IMS?"),Answer = c("Hey <firstname>, you will find it near helpdesk ","Hey <firstname>, moving to new places can be fun! To claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon).","ims or interview management system is a tool that helps interviewers schedule all the interviews"),
stringsAsFactors = FALSE)

s2 <- data.frame(
Question = c("Where to get books?", "Procedure to order stationary?","I would like to know about my relocation and relocation expenses","tell me about relocation expense claiming","how to claim relocation expense","IMS?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk.", "hey sonali moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon)","hey piyush moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail (from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates.", "hey vibha moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates", "ims or interview management system is a tool that helps interviewers schedule all the interviews")
stringsAsFactors = FALSE)

s1$Response=gsub('[[:punct:] ]+',' ',s1$Response)
s2$Response=gsub('[[:punct:] ]+',' ',s2$Response)
s1$Response <- tolower(s1$Response)
s2$Response <- tolower(s2$Response)
s1$Response<-as.character(s1$Response)
s2$Response<-as.character(s2$Response)
# data =s1, lookup=s2
d.matrix <- stringdistmatrix(a = s2$Response, b = s1$Response, useNames="strings",method="cosine", nthread = getOption("sd_num_thread"))

#list of minimun cosines
cosines<-apply(d.matrix, 2, min)

#return list of the row number of the minimum value
minlist<-apply(d.matrix, 2, which.min) 

#return list of best matching values
matchwith<-s2$Response[minlist]

#below table contains best match and cosines
answer<-data.frame(s1$Response, matchwith, cosines)
t11=merge(x=answer,y=s2, by.x="matchwith", by.y="Response", all.x=TRUE)
View(t11)`

t11 table looks like as shown below Далее мне нужно получить значение s1.Response = 3 для Вопроса: Процесс получения разрешений на перемещение перевода?вместе с названием категории.Пожалуйста, наведите меня на то, как это можно сделать.

Ответы [ 2 ]

0 голосов
/ 22 марта 2019

Если мы немного изменим ваш первый ввод, мы можем использовать пакеты fuzzyjoin / dplyr / stringr следующим образом:

df1 <- data.frame(
  Category = "Stationary",
  Question = "Where do I get stationary items from?",
  Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
  stringsAsFactors = FALSE
)

df2 <- data.frame(
    Category = c("Stat1", "Stat1"),
    Question = c("Where to get books?", "Procedure to order stationary?"),
    Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
    stringsAsFactors = FALSE
  )

Мы создаем шаблон регулярного выражения из Answer:

df1 <- dplyr::mutate(
  df1,
  Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
  Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
  Answer_regex = paste0("^",Answer_regex,"$"))  # make sure the match is exact

Мы используем stringr::str_detect с fuzzyjoin::fuzzy_left_join, чтобы найти совпадения:

res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
#                              Question.y                                  Answer.y                     Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$

Тогда мы можем посчитать:

dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
#          Answer.y                               n
#          <chr>                              <int>
# 1 Hey <firstname>, you will find it <here>.     2

Обратите внимание, что я включил пробелы за пределами < и > как часть заполнителей. Если бы я этого не делал, "Hey, Shekhar" не было бы совпадений из-за запятой.


изменить на адрес комментария:

df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"), 
                       match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
#   Category.x                     Question.x                                        Answer.x Category.y
# 1      Stat1            Where to get books?     Hey Anil, you will find it at the helpdesk. Stationary
# 2      Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk.       <NA>
#                              Question.y                                Answer.y               Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2                                  <NA>                                    <NA>                         <NA>


dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
#   Answer.y                                    n
#   <chr>                                   <int>
# 1 <NA>                                        1
# 2 Hey <firstname>, you will find it here.     1
0 голосов
/ 22 марта 2019

Вы можете попробовать сопоставить, используя функцию agrepl, которая позволяет установить максимальное «расстояние», которое является суммой «трансформаций, необходимых для перехода от шаблона к цели. Я бы вынул материал вокруг боковых угловых скобок». с sub:

agrepl(sub("<.+>, ", "", df1$Answer), df2$Answer, 8)
[1]  TRUE  TRUE FALSE

(Примечание: FALSE происходит из-за того, что я изменил второй фрейм данных так, чтобы он имел несовпадающее значение "answer".

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...