Очистка данных в R - PullRequest
0 голосов
/ 12 мая 2018

Я бы хотел использовать статистику с сайта Премьер-лиги для классного проекта. Это веб-сайт: https://www.premierleague.com/stats/top/players/goals

Существуют фильтры, которые позволяют нам фильтровать по сезону и другим факторам, а также кнопка внизу страницы, которая позволяет просматривать следующие 20 записей в таблице.

Мой код выглядит следующим образом:

library(tidyverse)
library(rvest)
url <- "https://www.premierleague.com/stats/top/players/goals?se=79"
url %>%
  read_html() %>% 
  html_nodes("table") %>% 
  .[[1]] %>% 
   html_table()

который выводит:

   Rank                  Player            Club         Nationality Stat
1     1            Alan Shearer               -             England  260
2     2            Wayne Rooney         Everton             England  208
3     3             Andrew Cole               -             England  187    
4     4           Frank Lampard               -             England  177
5     5           Thierry Henry               -              France  175
6     6           Robbie Fowler               -             England  163
7     7           Jermain Defoe AFC Bournemouth             England  162
8     8            Michael Owen               -             England  150
9     9           Les Ferdinand               -             England  149
10   10        Teddy Sheringham               -             England  146
11   11        Robin van Persie               -         Netherlands  144
12   12           Sergio Agüero Manchester City           Argentina  143
13   13 Jimmy Floyd Hasselbaink               -         Netherlands  127
14   14            Robbie Keane               -             Ireland  126
15   15          Nicolas Anelka               -              France  125
16   16            Dwight Yorke               - Trinidad And Tobago  123
17   17          Steven Gerrard               -             England  120
18   18              Ian Wright               -             England  113
19   19             Dion Dublin               -             England  111
20   20            Emile Heskey               -             England  110

Тем не менее, при изменении фильтров на сайте (например, в моем случае использования, ограничение таблицы текущим сезоном) и использовании стрелок для доступа к следующим 20 записям в таблице, URL-адрес не изменяется.

Я нашел соответствующие области исходного кода. Это:

<div data-script="pl_stats" data-widget="stats-table" data-current-size="20" 
data-stat="" data-type="player" data-page-size="20" data-page="0" data- 
comps="1" data-num-entries="2162">

<div class="dropDown noLabel topStatsFilterDropdown" data-listener="true">
    <div data-metric="mins_played" class="current currentStatContainer" 
aria-expanded="false">Minutes played</div>
    <ul class="dropdownList" role="listbox">

Я бы хотел иметь возможность изменять поля метрики данных и страницы данных.

Ответы [ 2 ]

0 голосов
/ 17 мая 2018

Другой способ - получить ресурс напрямую.

В вашем браузере откройте Инструменты разработчика ( F12 в Chrome / Chromium), перейдите в «Сеть», обновите ( F5 ) и посмотрите, что выглядит красиво отформатированным JSON. Найдя его, мы копируем адрес ссылки и заголовки (щелкните правой кнопкой мыши ресурс> Скопировать адрес ссылки, Скопировать заголовок запроса), а также выдать себя за браузер.

enter image description here

Здесь нам нужно 2: сначала идентификаторы сезона, а затем данные игрока:

library(httr)
library(purrr)
library(dplyr)

h <- add_headers(
  "Host" = "footballapi.pulselive.com",
  "Connection" = "keep-alive",
  "Origin" = "https://www.premierleague.com",
  "User-Agent" = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36",
  "Content-Type" = "application/x-www-form-urlencoded; charset=UTF-8",
  "Accept" = "*/*",
  "DNT" = "1",
  "Referer" = "https://www.premierleague.com/stats/top/players/goals",
  "Accept-Encoding" = "gzip, deflate, br",
  "Accept-Language" = "fr,en-US;q=0.9,en;q=0.8"
)


r <- GET("https://footballapi.pulselive.com/football/competitions/1/compseasons?page=0&pageSize=100", h)
seasons <- content(r, type = "application/json")$content %>% transpose() %>% map(unlist) %>% {setNames(.$id, .$label)}

res <- seasons %>% 
  map(~ {
    url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals",
                  "?page=0", "&pageSize=", "250", "&compSeasons=", .x, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
    # Be gentle
    Sys.sleep(0.5 + runif(1))
    r <- GET(url, h)
    jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content)
  }) %>%
  bind_rows(.id = "season")

glimpse(res)

# Observations: 6,460
# Variables: 35
# $ season                        <chr> "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18", "2017/18"...
# $ rank                          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 14, 14, 16, 16, 16, 16, 16, 16, ...
# $ name                          <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value                         <dbl> 32, 30, 21, 20, 18, 16, 15, 14, 13, 12, 12, 12, 12, 11, 11, 10, 10, 10, 10,...
# $ description                   <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId                <dbl> 47184, 6566, 6009, 11259, 31712, 49685, 19511, 48542, 31068, 6288, 11264, 4...
# $ owner.active                  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...
# $ owner.age                     <chr> "25 years 336 days", "24 years 293 days", "29 years 349 days", "31 years 12...
# $ owner.id                      <dbl> 5178, 3960, 4328, 8979, 4316, 4290, 13511, 6899, 19680, 4503, 8983, 4772, 4...
# $ owner.info.position           <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "F", "M", "F", "F", ...
# $ owner.info.shirtNum           <dbl> 11, 32, 10, 9, NA, 9, 9, 9, 33, 10, 26, 17, 7, 7, 9, 14, 23, 19, 4, 10, 19,...
# $ owner.info.positionInfo       <chr> "Left/Centre/Right Second Striker", "Centre Striker", "Centre Striker", "Ce...
# $ owner.info.loan               <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode    <chr> "EG", "GB-ENG", "AR", "GB-ENG", "GB-ENG", "BE", "BR", "FR", "BR", "BE", "DZ...
# $ owner.nationalTeam.country    <chr> "Egypt", "England", "Argentina", "England", "England", "Belgium", "Brazil",...
# $ owner.nationalTeam.demonym    <chr> "Egyptian", "English", NA, "English", "English", "Belgian", "Brazilian", "F...
# $ owner.currentTeam.name        <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.teamType    <chr> "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FIRST", "FI...
# $ owner.currentTeam.shortName   <chr> "Liverpool", "Spurs", "Man City", "Leicester", "Man City", "Man Utd", "Live...
# $ owner.currentTeam.id          <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.club.name   <chr> "Liverpool", "Tottenham Hotspur", "Manchester City", "Leicester City", "Man...
# $ owner.currentTeam.club.abbr   <chr> "LIV", "TOT", "MCI", "LEI", "MCI", "MUN", "LIV", "ARS", "MCI", "CHE", "LEI"...
# $ owner.currentTeam.club.id     <dbl> 10, 21, 11, 26, 11, 12, 10, 1, 11, 4, 26, 131, 21, 25, 4, 1, 21, 10, 6, 7, ...
# $ owner.currentTeam.altIds.opta <chr> "t14", "t6", "t43", "t13", "t43", "t1", "t14", "t3", "t43", "t8", "t13", "t...
# $ owner.birth.place             <chr> "Basyoun", NA, "Buenos Aires", NA, "Kingston", "Antwerp", "Maceio", "Lyon",...
# $ owner.birth.date.millis       <dbl> 708566400000, 743817600000, 581212800000, 537321600000, 786844800000, 73725...
# $ owner.birth.date.label        <chr> "15 June 1992", "28 July 1993", "2 June 1988", "11 January 1987", "8 Decemb...
# $ owner.birth.country.isoCode   <chr> "EG", "GB-ENG", "AR", "GB-ENG", "JM", "BE", "BR", "FR", "BR", "BE", "FR", "...
# $ owner.birth.country.country   <chr> "Egypt", "England", "Argentina", "England", "Jamaica", "Belgium", "Brazil",...
# $ owner.birth.country.demonym   <chr> "Egyptian", "English", NA, "English", "Jamaican", "Belgian", "Brazilian", "...
# $ owner.name.display            <chr> "Mohamed Salah", "Harry Kane", "Sergio Agüero", "Jamie Vardy", "Raheem Ster...
# $ owner.name.first              <chr> "Mohamed", "Harry", "Sergio", "Jamie", "Raheem", "Romelu", "Roberto Firmino...
# $ owner.name.last               <chr> "Salah Ghaly", "Kane", "Agüero", "Vardy", "Sterling", "Lukaku", "Barbosa de...
# $ owner.name.middle             <chr> NA, NA, NA, NA, "Shaquille", "Menama", NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
# $ owner.altIds.opta             <chr> "p118748", "p78830", "p37572", "p101668", "p103955", "p66749", "p92217", "p...

Редактировать
Кажется, вам нужны данные «Все сезоны», на самом деле это немного проще:

url <- paste0("https://footballapi.pulselive.com/football/stats/ranked/players/goals?",
              "page=0&pageSize=", 1000, "&comps=1&compCodeForActivePlayer=EN_PR&altIds=true")
r <- GET(url, h)
dat <- jsonlite::flatten(jsonlite::fromJSON(content(r, as = "text"))$stats$content) 

# Observations: 1,000
# Variables: 34
# $ rank                          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...
# $ name                          <chr> "goals", "goals", "goals", "goals", "goals", "goals", "goals", "goals", "go...
# $ value                         <dbl> 260, 208, 187, 177, 175, 163, 162, 150, 149, 146, 144, 143, 127, 126, 125, ...
# $ description                   <chr> "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", "Todo: goals", ...
# $ owner.playerId                <dbl> 1440, 49682, 4848, 14004, 25936, 55008, 47174, 6476, 3754, 110022, 93894, 6...
# $ owner.active                  <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, ...
# $ owner.age                     <chr> "47 years 277 days", "32 years 205 days", "46 years 214 days", "39 years 33...
# $ owner.id                      <dbl> 89, 2064, 725, 800, 1659, 277, 1526, 1208, 462, 576, 2616, 4328, 1413, 1692...
# $ owner.info.position           <chr> "F", "F", "F", "M", NA, "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "...
# $ owner.info.shirtNum           <dbl> 9, 10, 20, 18, NA, 27, 18, 10, NA, NA, 32, 10, 18, 20, 39, 19, 8, NA, NA, 1...
# $ owner.info.positionInfo       <chr> "Forward", "Centre Second Striker", "Forward", "Centre Central Midfielder",...
# $ owner.info.loan               <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
# $ owner.nationalTeam.isoCode    <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.nationalTeam.country    <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.nationalTeam.demonym    <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.birth.place             <chr> NA, "Liverpool", NA, "Romford", "Les Ulis", NA, "Beckton", NA, NA, NA, "Rot...
# $ owner.birth.date.millis       <dbl> 19353600000, 498960000000, 56332800000, 267148800000, 240624000000, 1662336...
# $ owner.birth.date.label        <chr> "13 August 1970", "24 October 1985", "15 October 1971", "20 June 1978", "17...
# $ owner.birth.country.isoCode   <chr> "GB-ENG", "GB-ENG", "GB-ENG", "GB-ENG", "FR", "GB-ENG", "GB-ENG", "GB-ENG",...
# $ owner.birth.country.country   <chr> "England", "England", "England", "England", "France", "England", "England",...
# $ owner.birth.country.demonym   <chr> "English", "English", "English", "English", "French", "English", "English",...
# $ owner.name.display            <chr> "Alan Shearer", "Wayne Rooney", "Andrew Cole", "Frank Lampard", "Thierry He...
# $ owner.name.first              <chr> "Alan", "Wayne", "Andrew", "Frank", "Thierry", "Robbie", "Jermain", "Michae...
# $ owner.name.last               <chr> "Shearer", "Rooney", "Cole", "Lampard", "Henry", "Fowler", "Defoe", "Owen",...
# $ owner.name.middle             <chr> NA, "Mark", NA, "James", NA, NA, "Colin", NA, NA, NA, NA, NA, NA, NA, NA, N...
# $ owner.altIds.opta             <chr> "p1", "p13017", "p1820", "p2051", "p1619", "p1794", "p7958", "p1795", "p190...
# $ owner.currentTeam.name        <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "AFC Bourne...
# $ owner.currentTeam.teamType    <chr> NA, "FIRST", NA, "FIRST", "FIRST", NA, "FIRST", NA, NA, NA, "FIRST", "FIRST...
# $ owner.currentTeam.shortName   <chr> NA, "Everton", NA, "Man City", "New York Red Bulls", NA, "AFC Bournemouth",...
# $ owner.currentTeam.id          <dbl> NA, 7, NA, 11, 599, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.club.name   <chr> NA, "Everton", NA, "Manchester City", "New York Red Bulls", NA, "Bournemout...
# $ owner.currentTeam.club.abbr   <chr> NA, "EVE", NA, "MCI", "NY", NA, "BOU", NA, NA, NA, "FEY", "MCI", NA, NA, NA...
# $ owner.currentTeam.club.id     <dbl> NA, 7, NA, 11, 479, NA, 127, NA, NA, NA, 236, 11, NA, NA, NA, NA, NA, NA, N...
# $ owner.currentTeam.altIds.opta <chr> NA, "t11", NA, "t43", "t399", NA, "t91", NA, NA, NA, "t198", "t43", NA, NA,...

(я не смотрел данные подробно, но на первый взгляд кажется, что это не то же самое, что просто агрегирование:

res %>% 
  group_by(owner.playerId, owner.name.first, owner.name.last) %>% 
  summarise(total_goals = sum(value, na.rm = TRUE)) %>% 
  arrange(desc(total_goals))

# # A tibble: 433 x 4
# # Groups:   owner.playerId, owner.name.first [433]
#    owner.playerId owner.name.first owner.name.last     total_goals
#             <dbl> <chr>            <chr>                     <dbl>
#  1           6566 Harry            Kane                         84
#  2           6009 Sergio           Agüero                       65
#  3          49685 Romelu           Lukaku                       59
#  4          11259 Jamie            Vardy                        57
#  5          93889 Alexis           Sánchez                      46
#  6          14848 Bamidele         Alli                         37
#  7          19511 Roberto Firmino  Barbosa de Oliveira          36
#  8          11264 Riyad            Mahrez                       35
#  9          94221 Olivier          Giroud                       35
# 10          30549 Sadio            Mané                         34
# # ... with 423 more rows
0 голосов
/ 17 мая 2018

Это решение требует, чтобы у вас был доступ к серверу селена.

library(RSelenium) # not on cran (install with devtools::install_github("ropensci/RSelenium"))
library(rvest)

# helper functions ---------------------------

# click_el() solves the problem mentioned here:
# https://stackoverflow.com/questions/11908249/debugging-element-is-not-clickable-at-point-error
click_el <- function(rem_dr, el) {
  rem_dr$executeScript("arguments[0].click();", args = list(el))
}

# wrapper around findElement()
find_el <- function(rem_dr, xpath) {
  rem_dr$findElement("xpath", xpath)
}

# check if an element exists on the dom
el_exists <- function(rem_dr, xpath) {
  maybe_el <- read_html(rem_dr$getPageSource()[[1]]) %>%
    xml_find_first(xpath = xpath)
  length(maybe_el) != 0
}

# try to click on a element if it exists
click_if_exists <- function(rem_dr, xpath) {
  if (el_exists(rem_dr, xpath)) {
    suppressMessages({
      try({
        el <- find_el(rem_dr, xpath)
        el$clickElement()
      }, silent = TRUE
      )
    })
  }
}

# close google adds so they don't get in the way of clicking other elements
maybe_close_ads <- function(rem_dr) {
  click_if_exists(rem_dr, '//a[@id="advertClose" and @class="closeBtn"]')
}

# click on button that requires we accept cookies
maybe_accept_cookies <- function(rem_dr) {
  click_if_exists(rem_dr, "//div[@class='btn-primary cookies-notice-accept']")
}

# parse the data table you're interested in
get_tbl <- function(rem_dr) {
  read_html(rem_dr$getPageSource()[[1]]) %>% 
    html_nodes("table") %>% 
    .[[1]] %>% 
    html_table()
}

# actual execution ---------------------------

# first u need to start selenium server...i'm running the server inside a 
# docker container and having it listen on port 4445 on my local machine
# (see http://rpubs.com/johndharrison/RSelenium-Basics for more details):
`docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1`

# connect to selenium server from within r
rem_dr <- remoteDriver(
  remoteServerAddr = "localhost", port = 4445L, browserName = "firefox"
)
rem_dr$open()

# go to webpage
rem_dr$navigate("https://www.premierleague.com/stats/top/players/goals")

# close adds
maybe_close_ads(rem_dr)
Sys.sleep(3)

# the seasons to iterate over
start <- 1992:2017 # u may want to replace this with `start <- 1992:1995` when testing
seasons <- paste0(start, "/", substr(start + 1, 3, 4))

# list to hold each season's data
out_list <- vector("list", length(seasons))
names(out_list) <- seasons

for (season in seasons) {

  maybe_close_ads(rem_dr)

  # to filter the data by season, we first need to click on the "filter by season" drop down
  # menu, so that the divs representing the various seasons become active (otherwise, 
  # we can't click them)
  cur_season <- find_el(
    rem_dr, '//div[@class="current" and @data-dropdown-current="FOOTBALL_COMPSEASON" and @role="button"]'
  )
  click_el(rem_dr, cur_season)
  Sys.sleep(3)

  # now we can select the season of interest
  xpath <- sprintf(
    '//ul[@data-dropdown-list="FOOTBALL_COMPSEASON"]/li[@data-option-name="%s"]', 
    season
  )
  season_lnk <- find_el(rem_dr, xpath)
  click_el(rem_dr, season_lnk)
  Sys.sleep(3)

  # parse the table shown on the first page
  tbl <- get_tbl(rem_dr)

  # iterate over all additional pages 
  nxt_page_act <- '//div[@class="paginationBtn paginationNextContainer"]'
  nxt_page_inact <- '//div[@class="paginationBtn paginationNextContainer inactive"]'

  while (!el_exists(rem_dr, nxt_page_inact)) {

    maybe_close_ads(rem_dr)
    maybe_accept_cookies(rem_dr)

    rem_dr$maxWindowSize()
    btn <- find_el(rem_dr, nxt_page_act)
    click_el(rem_dr, btn) # click "next button"

    maybe_accept_cookies(rem_dr)
    new_tbl <- get_tbl(rem_dr)
    tbl <- rbind(tbl, new_tbl)
    cat(".")
    Sys.sleep(2)
  }

  # put this season's data into the output list
  out_list[[season]] <- tbl
  print(season)
}

Это займет немного времени для запуска.Когда я его запустил, у меня было 6 731 ряд данных (за все сезоны).

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...