Извлечение чисел с десятичными числами из больших строк в R - PullRequest
0 голосов
/ 18 октября 2018

Я хотел бы извлечь числа из этого вектора, состоящего из 15 наблюдений:

rs <- c("\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.0\n                    (1 rating)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            9 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.7\n                    (4 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            34 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.1\n                    (5 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            22 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    2.4\n                    (14 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            2,106 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.3\n                    (67 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            1,287 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (3 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            30 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        New\n    \n\n\n                \n\n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    0.0\n                    (0 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            8 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        Highest Rated\n    \n\n\n                \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            42 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.4\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            41 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.2\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            115 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            25 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (19 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            151 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.5\n                    (10 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            385 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (166 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            754 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.6\n                    (34 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            3,396 students enrolled\n        \n    \n\n\n    \n\n    "
)

Как видите, 15 очень длинных и грязных объектов.Тем не менее, шаблон внутри них легко определить.Каждый объект состоит из 3 чисел (на примере первого наблюдения):

  • Оценка: от 0 до 5. Например, 4.0
  • Количество оценок.Например, (1 rating)
  • Зачисленные студенты.Например, 9 students enrolled.

Я хотел бы извлечь все эти числовые значения и создать фрейм данных из 3 столбцов, каждый для каждой переменной.

Я проверял несколько вопросовздесь, в Stackoverflow, в основном сосредоточены на использовании gsub() пакета stringr.Однако я не могу найти ключевое решение моей проблемы.

ОБНОВЛЕНИЕ

Вот коды, которые я пробовал:

as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\\()[0-9]+(?=\\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))

Ответы [ 3 ]

0 голосов
/ 18 октября 2018

Итак, ваша проблема в том, что он не видит "."как часть числа, так как он находится в строке.Так что вам нужно явно найти цифры и десятичную точку.

Rating <- as.numeric(str_extract(rs, "[0-9]\\.[0-9]"))
NRatings <- str_extract(rs, "\\([0-9]") %>% str_replace("\\(","") %>% as.numeric() 

Я позволю вам выяснить последний на основе этих примеров;)

0 голосов
/ 18 октября 2018

1-решение для базы зависимостей R с закомментированным, читаемым регулярным выражением.

Здесь также показано, как очистить текст для обработки (таким образом, что вы можете использовать его повторно).

library(stringi)

do.call(
  rbind.data.frame,
  lapply(
    stri_match_all_regex(
      stri_replace_all_regex(
        stri_trim_both(rs),             # clean up outer spaces
        "[[:blank:][:space:]]+", " "    # clean up inner spaces
      ),
      "
([[:digit:]\\.]+)[[:space:]]+\\(([[:digit:],]+)[[:space:]]+rating[s]*\\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled                          # pick up the number of students enrolled
",
      opts_regex = stri_opts_regex(comments = TRUE),
    ),
    function(x) {
      as.list(
        setNames(
          x[2:4], c("rating", "n_ratings", "enrolled")
        ),
        stringsAsFactors = FALSE
      )
    }
  )
)

В результате:

##    rating n_ratings enrolled
## 2     4.0         1        9
## 21    4.7         4       34
## 3     3.1         5       22
## 4     2.4        14    2,106
## 5     4.3        67    1,287
## 6     4.6         3       30
## 7     0.0         0        8
## 8     4.6        12       42
## 9     4.4         6       41
## 10    4.2        12      115
## 11    4.8         6       25
## 12    4.6        19      151
## 13    4.5        10      385
## 14    4.8       166      754
## 15    3.6        34    3,396

Превратить ^^ в # после этого довольно просто.

0 голосов
/ 18 октября 2018

С extract из tidyr мы можем сделать:

library(dplyr)
library(tidyr)

data.frame(rs, stringsAsFactors = FALSE) %>%
  extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
          "(?s)(\\d\\.\\d).*?(\\d+)\\s*ratings?.*?(\\d+(?:,\\d+)?)\\s*students enrolled", 
          convert = TRUE) %>%
  mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))

Вывод:

   Rating Number_of_ratings Students_enrolled
1     4.0                 1                 9
2     4.7                 4                34
3     3.1                 5                22
4     2.4                14              2106
5     4.3                67              1287
6     4.6                 3                30
7     0.0                 0                 8
8     4.6                12                42
9     4.4                 6                41
10    4.2                12               115
11    4.8                 6                25
12    4.6                19               151
13    4.5                10               385
14    4.8               166               754
15    3.6                34              3396

Примечания:

Регулярное выражение выглядит сложным, но на самом деле это не так.extract делает то, что извлекает совпадение из каждой группы захвата (вещи, заключенные в скобки) и превращает их в свой собственный столбец.

  1. (?s) - это модификатор, который включает режим «ДОТАЛ».Это позволяет точке . также совпадать с символами новой строки.

  2. (\\d\\.\\d) соответствует шаблону Rating

  3. (\\d+)\\s*ratings.шаблон Number_of_ratings, но извлекает только цифры (\\d+)

  4. (\\d+(?:,\\d+)?)\\s*students enrolled соответствует шаблону Students_enrolled, но извлекает только «цифры с запятой» или без нее

  5. convert = TRUE пытается преобразовать результирующие столбцы в их лучший тип данных, но поскольку в Students_enrolled есть запятые, для преобразования в числовое значение

    * требуется дополнительный mutate1049 *

Обычно extract выдает ошибку, если количество групп захвата не равно количеству выходных столбцов, но поскольку модификаторы (?s) и группы без захвата (?:...) не считаются захватомгрупп, количество групп захвата соответствует количеству столбцов.

...