Question

Я пытаюсь очистить веб-сайт, используя следующее:

industryurl <- "https://finance.yahoo.com/industries"

library(rvest)

read <- read_html(industryurl) %>%
  html_table()

library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]

read <- read_html(industryurl)

industryurls <- html_attr(html_nodes(read, "a"), "href")

links <- industryurls[grep("/industry/", industryurls)]

industryurl <- "https://finance.yahoo.com"

links <- paste0(industryurl, links)
links
##############################################################################################

store <- NULL
tbl <- NULL

for(i in links){
  store[[i]] = read_html(i)
  tbl[[i]] = html_table(store[[i]])
}


#################################################################################################

Меня больше всего интересует код между ########## и я хочу применить функцию вместо for loop, так как ясталкиваюсь с проблемами тайм-аута с Yahoo, и я хочу сделать это более человечным, как извлекать эти данные (это не слишком много).

Мой вопрос: как я могу взять links применить функцию и установить своего рода таймер задержки для чтения содержимого for loop?

Я могу вставить свою собственную версиюиз for loop, который не работает.

Harro Cyranka · Answer 1 · 28 октября 2018

Это функция, которую я придумал

##First argument is the link you need
##The second argument is the total time for Sys.sleep

extract_function <- function(define_link, define_time){
         print(paste0("The system will stop for: ", define_time, " seconds"))
         Sys.sleep(define_time)
         first <- read_html(define_link)
         print(paste0("It will now return the table for link", define_link))
         return(html_table(first))
}

##I added the following tryCatch function
       link_try_catch <- function(define_link, define_time){
       out <- tryCatch(extract_function(define_link,define_time), error = 
       function(e) NA)
       return(out)
}

##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5

p <- lapply(1:10, function(i)link_try_catch(links[i],1))

##OR (I subset the vector just for demo purposes

p2 <- lapply(links[1:10], function(i)extract_function(i,1))

Надеюсь, это поможет

изменить цикл для функции, чтобы очистить сайт

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

изменить цикл для функции, чтобы очистить сайт

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов