загрузка данных и сохранение данных в папку в пакетном режиме - PullRequest
0 голосов
/ 20 февраля 2019

У меня есть 200 000 ссылок, которые я пытаюсь загрузить, я пытался загрузить все это за один раз, но у меня возникли проблемы с памятью.

Я пытаюсь создать функцию, которая будет загружать 1000 ссылок за раз.время и сохранить их в папке.

Пакеты:

library(dplyr)
library(purrr)
library(edgarWebR)

Небольшая выборка данных выглядит следующим образом:

Данные 1:

urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)

Затем я применяю следующую функцию, чтобы загрузить эти 10 ссылок.

parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))

. Когда она хранится в виде красивого списка, я могу затем применить names(parsed_files) <- urls_to_parse, чтобы назвать списки ссылками, с которых они загружалиих от.Я также могу использовать output <- plyr::ldply(parsed_files, data.frame) для хранения всего в хорошем фрейме данных.

Используя приведенные ниже данные, как я могу создать пакеты для загрузки данных, скажем, партиями по 10?

То, что у меня есть на данный момент:

start = 1
end = 100

output <- NULL
output_fin <- NULL

for(i in start:end){
  output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
  names(output) <- urls_to_parse[start:end]
  save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}

Я уверен, что есть лучший способ использования функции, так как этот код ломается для некоторых результатов.

Дополнительные данные: - 100 ссылок

urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm", 
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm", 
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm", 
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm", 
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)

1 Ответ

0 голосов
/ 20 февраля 2019

Цикл, чтобы выполнить пакетную работу, как вы показали, плохая идея.Если у вас есть тысячи файлов для скачивания, как вы восстанавливаетесь после ошибок?

Производительность зависит не только от конфигурации вашего компьютера, но производительность сети имеет решающее значение.

Вот несколько предложений.

Опция 1

Почему я использую очередь?Потому что вы можете легко повторить попытку ошибки.

Псевдокод


file_url_partitions <- partion_as_batches(all_urls, batch_size) 
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
  batch = file_url_partitions.pop()

  tryCatch({
   download_parallel(batch)
  }, some_exception = function(se) {
    file_url_partitions.push(batch)
    attemp = attempt+1 
  })
}

Примечание. У меня нет доступа к R studio / environment, поэтому нет возможности попробовать.

Вариант 2 Загрузка файлов отдельно с помощью менеджера загрузок / аналогичных и использование загруженных файлов.

Некоторые полезные ресурсы: https://www.r -bloggers.com/ r-with-параллельные вычисления с точки зрения пользователя / http://adv -r.had.co.nz / beyond-exception-processing.html

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...