Как интегрировать пользовательский обработчик загрузки в Scrapy, используя Python - PullRequest
0 голосов
/ 19 июня 2020

Я новичок в программировании, поэтому мои знания ограничены, заранее извиняюсь за любые глупые ошибки.

Я пытаюсь очистить элементы с веб-сайта, защищенного облачной вспышкой ...

Из-за этого я должен передать значение Cook ie, пользовательский агент ( оба из которых я извлекаю из API-интерфейса облака) и прокси, который я использовал для получения ключей из облакаскреба.

Я планирую передать все эти ключи вместе пользовательскому обработчику загрузки, который работает на Firefox selenium Webdriver. Это сделано для того, чтобы я мог использовать возможности Firefox JS вместе с этими ключами, чтобы go не обнаруживалось с помощью уникальных файлов cookie, пользовательских агентов и прокси для каждого запроса.

У меня есть следующий код в моем Файл middleware.py, который работает, как ожидалось, и, кажется, загружает страницу нормально и остается незамеченным (я знаю это, поскольку могу видеть, что страница загружается полностью без ошибок, когда я запускаю Firefox без активированного режима без головы):

class JSDownloadHandler(object):

def download_request(self, request, spider):
    #       RETRIEVE KEYS       #
    # ProxyKeys(self, request, spider)
    # AgentCookieKeys(self, request, spider)

    #       USER AGENT & SETTINGS        #
    profile = webdriver.FirefoxProfile()
    # profile.set_preference("general.useragent.override", "{USER_AGENT}".format(USER_AGENT=request.meta['ua']))
    profile.set_preference("general.useragent.override", "{USER_AGENT}".format(USER_AGENT='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.1785.251 Mobile Safari/537.36'))
    profile.set_preference("dom.webdriver.enabled", False)
    profile.set_preference('useAutomationExtension', False)

    #       HEADLESS BROWSER        #
    # options = FirefoxOptions()
    # options.add_argument("--headless")

    #       PROXY SETTINGS      #
    # proxy = "{PROXY}".format(PROXY=request.meta['proxy'])
    proxy = "{PROXY}".format(PROXY='162.211.122.78:5836')
    firefox_capabilities = webdriver.DesiredCapabilities.FIREFOX
    firefox_capabilities['marionette'] = True
    firefox_capabilities['proxy'] = {
        "proxyType": "MANUAL",
        "httpProxy": proxy,
        "ftpProxy": proxy,
        "sslProxy": proxy
    }

    #       INSTANTIATE BROWSER     #
    self.driver = webdriver.Firefox(executable_path='/Users/lewbra/Desktop/geckodriver', capabilities=firefox_capabilities, firefox_profile=profile)
    self.driver.set_window_size(1366, 768)

    # RANDOM WAIT TIME GENERATOR [seconds] / lowest value = 2 / highest value = 8 / (3 d.p.)
    WAIT = round(random.uniform(2, 8), 3)

    #       GET URL HTML     #
    self.driver.get(request.url)
    # self.driver.add_cookie({"name": "__cfduid", "value": "{VALUE}".format(VALUE=request.meta['__cfduid'])})
    self.driver.add_cookie({"name": "__cfduid", "value": "{VALUE}".format(VALUE='df8b1a45e028713c1d91006a180c101db1592572249')})
    body = self.driver.page_source

    time.sleep(WAIT)
    loguru.logger.info("PAGE HTML EXTRACTED SUCCESSFULLY... /nn __CLOSING DRIVER__")
    self.driver.quit()

    return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)

Закомментированные строки, содержащие UA, Proxy и Cook ie, представляют собой строки кода, которые я собираюсь использовать для получения ключей из API, а в нижеследующих строках ключи вводятся вручную для целей тестирования, поэтому Я мог посчитать извлечение ключей из источника проблемы.

Я реализовал обработчик загрузки в settings.py, используя:

DOWNLOADER_MIDDLEWARES = { 
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': None,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
}

DOWNLOAD_HANDLERS = {
'http': 'sxsneakers.middlewares.JSDownloadHandler',
'https': 'sxsneakers.middlewares.JSDownloadHandler',
}

Когда я запускаю код Я получаю следующий результат:

(scrape) lewbra@Lewiss-MacBook-Pro sxsneakers % scrapy crawl SXSneakers
2020-06-19 15:47:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: sxsneakers)
2020-06-19 15:47:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform macOS-10.15.4-x86_64-i386-64bit
2020-06-19 15:47:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-19 15:47:00 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_START_DELAY': 3,
 'BOT_NAME': 'sxsneakers',
 'CONCURRENT_REQUESTS': 5,
 'DOWNLOAD_DELAY': 3,
 'NEWSPIDER_MODULE': 'sxsneakers.spiders',
 'RETRY_TIMES': 5,
 'SPIDER_MODULES': ['sxsneakers.spiders']}
2020-06-19 15:47:00 [scrapy.extensions.telnet] INFO: Telnet Password: 383c71ee4edc8b10
2020-06-19 15:47:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.throttle.AutoThrottle']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled item pipelines:
['sxsneakers.pipelines.SxsneakersPipeline']
2020-06-19 15:47:01 [scrapy.core.engine] INFO: Spider opened
2020-06-19 15:47:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-19 15:47:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-19 15:47:01 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true, "proxy": {"proxyType": "manual", "httpProxy": "162.211.122.78:5836", "ftpProxy": "162.211.122.78:5836", "sslProxy": "162.211.122.78:5836"}, "moz:firefoxOptions": {"profile": "UEsDBBQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAdXNlci5qc6VWWXPbNhB+76/g+MmeiWDJrpJM/aQ47jHx0bHq8SMHBJYiLBBAcYhSfn0XpOioDg+lfePxfVjs9e0GBzY1FvLTk8zqCt8I1yWphOK6IjyUhoCimQR+8i7xNsDZ1U/he46kHqwN6gCcU+l60AoqTzMSrETcCc108L9kkqr1yRDe0BUcd77z1PpgSGQgcjqMKnQJe+QRd3nLSvUGPRccSOm8VvUpYqW0hZ4D8BN/UM91gN1sSoTyVj893g6bj0kp6TZlhUXDqWNWGJ9ixFMvymj0ctpNOA5pwWjrUypl+uJS2DIwXmjlOpP+Qje0OZboBkZcoas/1DU+agmdJAW+0nZNGEXKBibRHpUTBGwEG8mrAxas8DvCXFuOPcC3yalAshiv0VIbYhLKuYh+0pFDKsi4RedsSlmMYBowt8F54CkD67ujeUByLkSLrySBH8COsJp4pIrGsKawAdVjaAUKLJWxAC16qDB7+9KNXt3pr0JKej4n0+T0VqiwvUoWilstePKeTK+Se9gGl8yTT0FIfn73uJh/vD9LFsZIeIbsi/Dn88sP5PJ9cvrl97/ubt8lUqwh+Q3YWp8l13XZns9nZEpmHz7OycV8ltzpTEhIljSnVuzZHWX/6upwkeDjInhd0pipm60H5fChG0uNIcFwFC1CkTMKOkp1uHBG0h3wm9l0ea89ljUSfu4B60pJTTn2qMJk2LqDngtQy1iIQq2Gbd083S4O09evzDXykjQFOabiQq2xpUGhAqCeN/V+MYZFdW7GRT9Y5znie5q2BVmgPEYGPDC86a/COr/AWLAeRXlVZJpD/YxRO3I+HDJKKitqj50sQC0r9mUxBpW5CybqXDvp+kYKuFiqOD7wGhZqEcixX1JmqStGrBQgJWEFttlnyGmQ/lPzZ5iG49QRdFs9qGup3YgrB+gHzHg3GCNCmzESo1oAlb5o3slRIj/ADya2ys1/ZLfW81hQj+HHry/1ahU/sma8/cA9jJaC7Uj8uAxZKeo8/1/+n/XHxbeG7joGNl5r6QgqhLb7iw9uclFpUcHqWVK3NceRq1ZpjHy3DWg11tUi+rkhLxmy4wCaTXvhmdRsLYXzwxVxwKgL/FqXBtUdh0bcBZRYFV7uRqnHSPj3cIX6ne+eehvpYAcaimq79TQqP9nr4AS3DR9c98ktpfAeN9hCuGI3ib8NrgcTCWrloyRczOf/prVH49TCfEipqzTbpbzRhM67GZQY413s7rTUHDeD45awHIc2PhFtBbZF2tRo3CtH8M5bwXz6ltZlKi7QCqWwhDIDu6xfe8IVy3wtPNmHrR2n+1LudLyleJBowdsdaUJRe34xhh0M0vdwCy/1OOu8ydYIlH3MFokuY02g+iPj7yBsn4VvlKoQaCZ2UQfjH1BLAQIUAxQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAAAAAAAAAAACkgQAAAAB1c2VyLmpzUEsFBgAAAAABAAEANQAAADkEAAAAAA=="}}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true, "proxy": {"proxyType": "MANUAL", "httpProxy": "162.211.122.78:5836", "ftpProxy": "162.211.122.78:5836", "sslProxy": "162.211.122.78:5836"}, "marionette": true, "moz:firefoxOptions": {"profile": "UEsDBBQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAdXNlci5qc6VWWXPbNhB+76/g+MmeiWDJrpJM/aQ47jHx0bHq8SMHBJYiLBBAcYhSfn0XpOioDg+lfePxfVjs9e0GBzY1FvLTk8zqCt8I1yWphOK6IjyUhoCimQR+8i7xNsDZ1U/he46kHqwN6gCcU+l60AoqTzMSrETcCc108L9kkqr1yRDe0BUcd77z1PpgSGQgcjqMKnQJe+QRd3nLSvUGPRccSOm8VvUpYqW0hZ4D8BN/UM91gN1sSoTyVj893g6bj0kp6TZlhUXDqWNWGJ9ixFMvymj0ctpNOA5pwWjrUypl+uJS2DIwXmjlOpP+Qje0OZboBkZcoas/1DU+agmdJAW+0nZNGEXKBibRHpUTBGwEG8mrAxas8DvCXFuOPcC3yalAshiv0VIbYhLKuYh+0pFDKsi4RedsSlmMYBowt8F54CkD67ujeUByLkSLrySBH8COsJp4pIrGsKawAdVjaAUKLJWxAC16qDB7+9KNXt3pr0JKej4n0+T0VqiwvUoWilstePKeTK+Se9gGl8yTT0FIfn73uJh/vD9LFsZIeIbsi/Dn88sP5PJ9cvrl97/ubt8lUqwh+Q3YWp8l13XZns9nZEpmHz7OycV8ltzpTEhIljSnVuzZHWX/6upwkeDjInhd0pipm60H5fChG0uNIcFwFC1CkTMKOkp1uHBG0h3wm9l0ea89ljUSfu4B60pJTTn2qMJk2LqDngtQy1iIQq2Gbd083S4O09evzDXykjQFOabiQq2xpUGhAqCeN/V+MYZFdW7GRT9Y5znie5q2BVmgPEYGPDC86a/COr/AWLAeRXlVZJpD/YxRO3I+HDJKKitqj50sQC0r9mUxBpW5CybqXDvp+kYKuFiqOD7wGhZqEcixX1JmqStGrBQgJWEFttlnyGmQ/lPzZ5iG49QRdFs9qGup3YgrB+gHzHg3GCNCmzESo1oAlb5o3slRIj/ADya2ys1/ZLfW81hQj+HHry/1ahU/sma8/cA9jJaC7Uj8uAxZKeo8/1/+n/XHxbeG7joGNl5r6QgqhLb7iw9uclFpUcHqWVK3NceRq1ZpjHy3DWg11tUi+rkhLxmy4wCaTXvhmdRsLYXzwxVxwKgL/FqXBtUdh0bcBZRYFV7uRqnHSPj3cIX6ne+eehvpYAcaimq79TQqP9nr4AS3DR9c98ktpfAeN9hCuGI3ib8NrgcTCWrloyRczOf/prVH49TCfEipqzTbpbzRhM67GZQY413s7rTUHDeD45awHIc2PhFtBbZF2tRo3CtH8M5bwXz6ltZlKi7QCqWwhDIDu6xfe8IVy3wtPNmHrR2n+1LudLyleJBowdsdaUJRe34xhh0M0vdwCy/1OOu8ydYIlH3MFokuY02g+iPj7yBsn4VvlKoQaCZ2UQfjH1BLAQIUAxQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAAAAAAAAAAACkgQAAAAB1c2VyLmpzUEsFBgAAAAABAAEANQAAADkEAAAAAA=="}}}
2020-06-19 15:47:01 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:54293
2020-06-19 15:47:04 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session HTTP/1.1" 200 867
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/window/rect {"x": null, "y": null, "width": 1366, "height": 768}
2020-06-19 15:47:04 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/window/rect HTTP/1.1" 200 50
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/url {"url": "https://stockx.com/sneakers"}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/url HTTP/1.1" 200 14
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/cookie {"cookie": {"name": "__cfduid", "value": "df8b1a45e028713c1d91006a180c101db1592572249"}}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/cookie HTTP/1.1" 200 14
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/source {}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "GET /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/source HTTP/1.1" 200 935477
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:44.635 | INFO     | sxsneakers.middlewares:download_request:214 - PAGE HTML EXTRACTED SUCCESSFULLY... /nn __CLOSING DRIVER__
2020-06-19 15:48:44 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48 {}
2020-06-19 15:48:48 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "DELETE /session/69979619-6f4b-8d44-ba38-328c7e6d6a48 HTTP/1.1" 200 14
2020-06-19 15:48:48 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-19 15:48:48 [scrapy.core.scraper] ERROR: Error downloading <GET https://stockx.com/sneakers>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
    result = result.throwExceptionIntoGenerator(g)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
    result = f(*args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
    return handler.download_request(request, spider)
  File "/Users/lewbra/Desktop/sxsneakers/sxsneakers/middlewares.py", line 217, in download_request
    return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
NameError: name 'download_request' is not defined
2020-06-19 15:48:48 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-19 15:48:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/builtins.NameError': 1,
 'downloader/request_bytes': 76,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'elapsed_time_seconds': 107.342463,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 6, 19, 14, 48, 48, 434159),
 'log_count/DEBUG': 19,
 'log_count/ERROR': 1,
 'log_count/INFO': 11,
 'memusage/max': 55971840,
 'memusage/startup': 55443456,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 6, 19, 14, 47, 1, 91696)}
2020-06-19 15:48:48 [scrapy.core.engine] INFO: Spider closed (finished)

Любая помощь в том, в чем проблема, была бы очень принята, заранее спасибо !!

...