Исключая плохие совпадения из неструктурированных поисков даты- R - PullRequest
1 голос
/ 12 июня 2019

У меня есть некоторые крайне неструктурированные данные о дате, которые содержат многочисленные ошибки. В настоящее время мой синтаксис захвата regex довольно хорош для получения всех дат, но он также захватывает числа, а не даты. Эти цифры обычно сопровождаются обозначениями, которые должны помочь предсказать, являются ли эти числа каким-то числом или датой.

uglydates = c(
  "05-01-2018 Worked on PP&E valve. Specimens are unusually active.",
  "55.2 psi containment pressure nominal.",
  "August 11, 2018 Personal Journal, I thought I would like being alone. I was wrong.",
  "34.1 PSI reported on containment unit 34. Loss of pressure, cause unknown.",
  "10 3/4 casing seems to have ruptured. Exterior has numerous punctures",
  "perhaps caused by a wild animal.",
  "1.06.19 Hearing chittering noises in the woods.",
  "Thursday, February 2, 2019 Returned to Bunker, Mr. Higglies is missing.",
  "Fri, February 3, 2019 through Sunday, February 5, 2019 Searched for Mr. Higglies",
  "Thursday, Feb 9, 19 What remained of Mr. Higglies found me...",
  "Bleeding profusely, returning to the silo.",
  "Friday, 2 27 19 - Have not been able to stop bleeding. Don't feel like eating.",
  "Leaving bunker in search of help.",
  "3 27 Can't walk any longer. Going to lie here for just a few minutes.")

library(dplyr)
library(stringr)

# Function for adding parentheses around text
par <- function(x) paste0("(",x,")")

months <- month.name  %>% paste(collapse= "|") %>% par
monab  <- month.abb  %>% paste(collapse= "|") %>% par
days    <- (Sys.Date() + (0:6)) %>% format("%A") %>% paste(collapse= "|") %>% par
dayab   <- (Sys.Date() + (0:6)) %>% format("%a") %>% paste(collapse= "|") %>% par
num <- "([1-9]|[0-3][0-9]|201[6-9])" # 01-39, 1-9, 2016-2018

daydate <- paste(days, dayab, months, monab, num, sep= "|") %>% par

sep <-"[/\\-\\s/\\.,]*" # seperators

end <- "[\\s:\\-\\.\n$]" # Define possible end values

datematch  <- paste0("^(?i)(",daydate,sep,"){1,5}(",end,")")
#"^(?i)(((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$])"

uglydates %>% str_extract(datematch)
# [1] "05-01-2018 "                 "55.2 "                       "August 11, 2018 "           
# [4] "34.1 "                       "10 3/4 "                     NA                           
# [7] "1.06.19 "                    "Thursday, February 2, 2019 " "Fri, February 3, 2019 "     
# [10] "Thursday, Feb 9, 19 "        NA                            "Friday, 2 27 19 - "         
# [13] NA                            "3 27 "   

Я попытался использовать синтаксис с отрицательным прогнозом ?!..., но, похоже, он не отменяет все, что мне нужно (захват всей строки).

exclude = "(PSI|casing)"
datematch  <- paste0("^(?i)((",daydate,sep,"){1,5}(",end,"))(?!", exclude,")")
# "^(?i)((((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$]))(?!(PSI|casing))"

uglydates %>% str_extract(datematch)
# [1] "05-01-2018 "                 "55."                         "August 11, 2018 "           
# [4] "34."                         "10 "                         NA                           
# [7] "1.06.19 "                    "Thursday, February 2, 2019 " "Fri, February 3, 2019 "     
# [10] "Thursday, Feb 9, 19 "        NA                            "Friday, 2 27 19 - "         
# [13] NA                            "3 27 "                  

1 Ответ

1 голос
/ 13 июня 2019

Текущие отрицательные запросы игнорируют только последнюю согласованную необязательную группу, как показано в этом фиктивном примере, см. Также, например, Регулярное выражение с необязательной частью и отрицательным прогнозом

str_extract("0-0-0 psi", "((0[-]?)+)(?!\\spsi)")
#> [1] "0-0-"

Создано в 2019-06-13 пакетом Представления (v0.3.0)

Простым решением является замена:

exclude <- "(.*(PSI|casing))" 

, который отменяет захват всей строки, если найдены PSI или casing:

uglydates = c(
    "05-01-2018 Worked on PP&E valve. Specimens are unusually active.",
    "55.2 psi containment pressure nominal.",
    "August 11, 2018 Personal Journal, I thought I would like being alone. I was wrong.",
    "34.1 PSI reported on containment unit 34. Loss of pressure, cause unknown.",
    "10 3/4 casing seems to have ruptured. Exterior has numerous punctures",
    "perhaps caused by a wild animal.",
    "1.06.19 Hearing chittering noises in the woods.",
    "Thursday, February 2, 2019 Returned to Bunker, Mr. Higglies is missing.",
    "Fri, February 3, 2019 through Sunday, February 5, 2019 Searched for Mr. Higglies",
    "Thursday, Feb 9, 19 What remained of Mr. Higglies found me...",
    "Bleeding profusely, returning to the silo.",
    "Friday, 2 27 19 - Have not been able to stop bleeding. Don't feel like eating.",
    "Leaving bunker in search of help.",
    "3 27 Can't walk any longer. Going to lie here for just a few minutes.")

library(dplyr)
library(stringr)

# Function for adding parentheses around text
par <- function(x) paste0("(",x,")")

months <- month.name  %>% paste(collapse= "|") %>% par
monab  <- month.abb  %>% paste(collapse= "|") %>% par
days    <- (Sys.Date() + (0:6)) %>% format("%A") %>% paste(collapse= "|") %>% par
dayab   <- (Sys.Date() + (0:6)) %>% format("%a") %>% paste(collapse= "|") %>% par
num <- "([1-9]|[0-3][0-9]|201[6-9])" # 01-39, 1-9, 2016-2018

daydate <- paste(days, dayab, months, monab, num, sep= "|") %>% par

sep <-"[/\\-\\s/\\.,]*" # seperators

end <- "[\\s:\\-\\.\n$]" # Define possible end values

exclude <- "(.*(PSI|casing))"
datematch  <- paste0("^(?i)((",daydate,sep,"){1,5}(",end,"))(?!", exclude,")")
# "^(?i)((((Wednesday|Thursday|Friday|Saturday|Sunday|Monday|Tuesday)|(Wed|Thu|Fri|Sat|Sun|Mon|Tue)|(January|February|March|April|May|June|July|August|September|October|November|December)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|([1-9]|[0-3][0-9]|201[6-9]))[/\\-\\s/\\.,]*){1,5}([\\s:\\-\\.\n$]))(?!(.*(PSI|casing)))"

uglydates %>% str_extract(datematch)
#>  [1] "05-01-2018 "                 NA                           
#>  [3] "August 11, 2018 "            NA                           
#>  [5] NA                            NA                           
#>  [7] "1.06.19 "                    "Thursday, February 2, 2019 "
#>  [9] "Fri, February 3, 2019 "      "Thursday, Feb 9, 19 "       
#> [11] NA                            "Friday, 2 27 19 - "         
#> [13] NA                            "3 27 "

Создано в 2019-06-13 пакетом Представить (v0.3.0)

...