Как сделать нечеткое сопоставление с квантидой и квай c? - PullRequest
2 голосов
/ 13 января 2020

У меня есть тексты, написанные врачами, и я хочу иметь возможность выделить конкретные c слова в их контексте (5 слов до и 5 слов после слова, которое я ищу в их тексте). Скажем, я хочу найти слово «самоубийство». Затем я бы использовал функцию kwi c в пакете quanteda:

kwi c (набор данных, pattern = «suicidal», window = 5)

Пока все хорошо, но скажу, что я хочу допустить возможность опечаток. В этом случае я хочу разрешить три отклоняющихся символа, без ограничения на то, где в слове они сделаны.

Возможно ли это сделать с помощью функции кванта kwi c?

Пример:

dataset <- data.frame("patient" = 1:9, "text" = c("On his first appointment, the patient was suicidal when he showed up in my office", 
                                  "On his first appointment, the patient was suicidaa when he showed up in my office",
                                  "On his first appointment, the patient was suiciaaa when he showed up in my office",
                                  "On his first appointment, the patient was suicaaal when he showed up in my office",
                                  "On his first appointment, the patient was suiaaaal when he showed up in my office",
                                  "On his first appointment, the patient was saacidal when he showed up in my office",
                                  "On his first appointment, the patient was suaaadal when he showed up in my office",
                                  "On his first appointment, the patient was icidal when he showed up in my office",
                                  "On his first appointment, the patient was uicida when he showed up in my office"))

dataset$text <- as.character(dataset$text)
kwic(dataset$text, pattern = "suicidal", window = 5)

даст мне только первое, правильно написанное предложение.

1 Ответ

2 голосов
/ 14 января 2020

Отличный вопрос. У нас нет приблизительного соответствия как «valuetype», но это интересная идея для будущего развития. А пока я бы предложил создать список фиксированных нечетких совпадений, используя base::agrep(), а затем сопоставить их. Таким образом, это будет выглядеть так:

library("quanteda")
## Package version: 1.5.2

dataset <- data.frame(
  "patient" = 1:9, "text" = c(
    "On his first appointment, the patient was suicidal when he showed up in my office",
    "On his first appointment, the patient was suicidaa when he showed up in my office",
    "On his first appointment, the patient was suiciaaa when he showed up in my office",
    "On his first appointment, the patient was suicaaal when he showed up in my office",
    "On his first appointment, the patient was suiaaaal when he showed up in my office",
    "On his first appointment, the patient was saacidal when he showed up in my office",
    "On his first appointment, the patient was suaaadal when he showed up in my office",
    "On his first appointment, the patient was icidal when he showed up in my office",
    "On his first appointment, the patient was uicida when he showed up in my office"
  ),
  stringsAsFactors = FALSE
)
corp <- corpus(dataset)

# get unique words
vocab <- tokens(corp, remove_numbers = TRUE, remove_punct = TRUE) %>%
  types()

Использование agrep() для генерации ближайших нечетких совпадений - и здесь я запускал их несколько раз, увеличивая max.distance каждый раз немного по сравнению со значением по умолчанию 0,1.

# get closest matches to "suicidal"
near_matches <- agrep("suicidal", vocab,
  max.distance = 0.3,
  ignore.case = TRUE, fixed = TRUE, value = TRUE
)
near_matches
## [1] "suicidal" "suicidaa" "suiciaaa" "suicaaal" "suiaaaal" "saacidal" "suaaadal"
## [8] "icidal"   "uicida"

Затем используйте это в качестве аргумента pattern для kwic():

# use these for fuzzy matching
kwic(corp, near_matches, window = 3)
##                                                        
##  [text1, 9] the patient was | suicidal | when he showed
##  [text2, 9] the patient was | suicidaa | when he showed
##  [text3, 9] the patient was | suiciaaa | when he showed
##  [text4, 9] the patient was | suicaaal | when he showed
##  [text5, 9] the patient was | suiaaaal | when he showed
##  [text6, 9] the patient was | saacidal | when he showed
##  [text7, 9] the patient was | suaaadal | when he showed
##  [text8, 9] the patient was |  icidal  | when he showed
##  [text9, 9] the patient was |  uicida  | when he showed

Существуют и другие возможности, основанные на аналогичных решениях, например fuzzyjoin или stringdist пакетов, но это простое решение из пакета base , которое должно работать очень хорошо.

...