Я сейчас пытаюсь сканировать веб-сайт.Гусеничный механизм работает довольно плавно.Тем не менее, после 3-4 часов сканирования скрипт иногда падает из-за отключения сервера / интернета.
Это сообщение об ошибке:
2019-09-27 10:53:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 1 pages/min), scraped 4480 items (at 10 items/min)
2019-09-27 10:54:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 1 times): 504 Gateway Time-out
2019-09-27 10:54:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:55:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 2 times): 504 Gateway Time-out
2019-09-27 10:55:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:56:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 3 times): 504 Gateway Time-out
2019-09-27 10:56:00 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (referer: https://blogabet.com/tipsters) ['partial']
2019-09-27 10:56:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480>: HTTP status code is not handled or not allowed
2019-09-27 10:56:00 [scrapy.core.engine] INFO: Closing spider (finished)
Я думаю, что имеет смысл увеличить RETRY_TIMES
с 2 (по умолчанию) до 10. Но я не знаю, как добавить его вsettings.py и к какой части моего кода.Вы можете найти мой код ниже.
# -*- coding: utf-8 -*-
# Import Scrapy ist verpflichtend
import scrapy
# Import Request ist da, da die Header mitgeschickt werden muessen (haengt von der Webseite ab)
from scrapy import Request
from scrapy.http import Request
from aufgehts.items import AufgehtsItem
from scrapy.spiders import CrawlSpider, Rule
import re
class BlogmeSpider(scrapy.Spider):
name = 'blogme'
def start_requests(self):
url = "https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=picks&f[start]=0"
headers={
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
yield scrapy.http.Request(url, headers=headers)
def parse(self, response):
listenings = response.xpath('//*[@class="block row no-padding-lg tipster-block"]')
for listening in listenings:
username = listening.xpath('.//h3[@class="name-t u-db u-mb1"]/strong/text()').extract_first()
link = listening.xpath('.//*[@class="e-mail u-db u-mb1 text-ellipsis"]/a/@href').extract_first()
yield {'Username': username,
'Link': link}
next_page_number = response.xpath('//*[@class="btn btn-danger"]/@onclick').re('-?\d+')
next_page_number = next_page_number[0]
url = "https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=picks&f[start]="
next_page_url= (url + next_page_number)
if next_page_number:
headers={
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,pl;q=0.8,de;q=0.7',
'Connection': 'keep-alive',
'Host': 'blogabet.com',
'Referer': 'https://blogabet.com/tipsters',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'
}
yield scrapy.http.Request(next_page_url, headers=headers, callback=self.parse)