Вычисление сходства между текстом для поиска дубликатов - PullRequest
0 голосов
/ 09 апреля 2019

У меня есть некоторые данные, подобные следующим, и благодаря тому, как я обработал данные, у меня есть несколько повторяющихся / повторяющихся строк, которые были немного неизбежны.

Я хочу вычислить косинусное расстояние междутексты.Затем попытайтесь удалить дублированные значения (сохраняя наблюдение, в котором больше всего текста).

Является ли это наилучшим способом поиска дублированного текста в данных?Текст может немного отличаться из-за удаления нескольких слов, поэтому unique(text) решает только часть проблемы.

Данные:

text <- c("Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2] These different variations of football are known as football codes.",
          "Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2]",
          "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.",
          "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return.",
          "Rugby refers to the team sports rugby league and rugby union. Legend claims that rugby football was started about 1845 in Rugby School, Rugby, Warwickshire, England, although forms of football in which the ball was carried and tossed date to medieval times. Rugby eventually split into two sports in 1895 when twenty-one clubs split from the original Rugby Football Union, to form the Northern Union (later to be named rugby league in 1922) in the George Hotel, Huddersfield, Northern England over the issue of payment to players, thus making rugby league the first code to turn professional and pay its players, rugby union turned fully professional in 1995. Both sports are run by their respective world governing bodies World Rugby (rugby union) and the Rugby League International Federation (rugby league). Rugby football was one of many versions of football played at English public schools in the 19th century.[1][2] Although rugby league initially used rugby union rules, they are now wholly separate sports. In addition to these two codes, both American and Canadian football evolved from rugby football.")


ID <- c("Foot123", "Foot123", "Ten123", "Ten123", "Rugby123")

data <- data.frame(text, ID)

1 Ответ

1 голос
/ 09 апреля 2019

Может быть, вы можете использовать jarowinkler из RecordLinkage pkg.

Вот пример кода.

library(RecordLinkage)
m <- lapply(text, function(x) jarowinkler(x, text))
m <- do.call(rbind, m)
colnames(m) <- paste0('X', 1:ncol(m))
rownames(m) <- paste0('X', 1:nrow(m))
sim <- apply(m, 1, function(x) {
  names(x)[x >= 0.9]
})
sim <- sapply(sim, function(x) x[1])
dplyr::tibble(ID = sim, text = text)

Теперь вам нужно решить, насколько вы хотите, чтобы тексты были похожи.

...