Я новичок в программировании, поэтому мои знания ограничены, заранее извиняюсь за любые глупые ошибки.
Я пытаюсь очистить элементы с веб-сайта, защищенного облачной вспышкой ...
Из-за этого я должен передать значение Cook ie, пользовательский агент ( оба из которых я извлекаю из API-интерфейса облака) и прокси, который я использовал для получения ключей из облакаскреба.
Я планирую передать все эти ключи вместе пользовательскому обработчику загрузки, который работает на Firefox selenium Webdriver. Это сделано для того, чтобы я мог использовать возможности Firefox JS вместе с этими ключами, чтобы go не обнаруживалось с помощью уникальных файлов cookie, пользовательских агентов и прокси для каждого запроса.
У меня есть следующий код в моем Файл middleware.py, который работает, как ожидалось, и, кажется, загружает страницу нормально и остается незамеченным (я знаю это, поскольку могу видеть, что страница загружается полностью без ошибок, когда я запускаю Firefox без активированного режима без головы):
class JSDownloadHandler(object):
def download_request(self, request, spider):
# RETRIEVE KEYS #
# ProxyKeys(self, request, spider)
# AgentCookieKeys(self, request, spider)
# USER AGENT & SETTINGS #
profile = webdriver.FirefoxProfile()
# profile.set_preference("general.useragent.override", "{USER_AGENT}".format(USER_AGENT=request.meta['ua']))
profile.set_preference("general.useragent.override", "{USER_AGENT}".format(USER_AGENT='Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.1785.251 Mobile Safari/537.36'))
profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
# HEADLESS BROWSER #
# options = FirefoxOptions()
# options.add_argument("--headless")
# PROXY SETTINGS #
# proxy = "{PROXY}".format(PROXY=request.meta['proxy'])
proxy = "{PROXY}".format(PROXY='162.211.122.78:5836')
firefox_capabilities = webdriver.DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True
firefox_capabilities['proxy'] = {
"proxyType": "MANUAL",
"httpProxy": proxy,
"ftpProxy": proxy,
"sslProxy": proxy
}
# INSTANTIATE BROWSER #
self.driver = webdriver.Firefox(executable_path='/Users/lewbra/Desktop/geckodriver', capabilities=firefox_capabilities, firefox_profile=profile)
self.driver.set_window_size(1366, 768)
# RANDOM WAIT TIME GENERATOR [seconds] / lowest value = 2 / highest value = 8 / (3 d.p.)
WAIT = round(random.uniform(2, 8), 3)
# GET URL HTML #
self.driver.get(request.url)
# self.driver.add_cookie({"name": "__cfduid", "value": "{VALUE}".format(VALUE=request.meta['__cfduid'])})
self.driver.add_cookie({"name": "__cfduid", "value": "{VALUE}".format(VALUE='df8b1a45e028713c1d91006a180c101db1592572249')})
body = self.driver.page_source
time.sleep(WAIT)
loguru.logger.info("PAGE HTML EXTRACTED SUCCESSFULLY... /nn __CLOSING DRIVER__")
self.driver.quit()
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
Закомментированные строки, содержащие UA, Proxy и Cook ie, представляют собой строки кода, которые я собираюсь использовать для получения ключей из API, а в нижеследующих строках ключи вводятся вручную для целей тестирования, поэтому Я мог посчитать извлечение ключей из источника проблемы.
Я реализовал обработчик загрузки в settings.py, используя:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': None,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': None,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
}
DOWNLOAD_HANDLERS = {
'http': 'sxsneakers.middlewares.JSDownloadHandler',
'https': 'sxsneakers.middlewares.JSDownloadHandler',
}
Когда я запускаю код Я получаю следующий результат:
(scrape) lewbra@Lewiss-MacBook-Pro sxsneakers % scrapy crawl SXSneakers
2020-06-19 15:47:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: sxsneakers)
2020-06-19 15:47:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform macOS-10.15.4-x86_64-i386-64bit
2020-06-19 15:47:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-19 15:47:00 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 3,
'BOT_NAME': 'sxsneakers',
'CONCURRENT_REQUESTS': 5,
'DOWNLOAD_DELAY': 3,
'NEWSPIDER_MODULE': 'sxsneakers.spiders',
'RETRY_TIMES': 5,
'SPIDER_MODULES': ['sxsneakers.spiders']}
2020-06-19 15:47:00 [scrapy.extensions.telnet] INFO: Telnet Password: 383c71ee4edc8b10
2020-06-19 15:47:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-06-19 15:47:01 [scrapy.middleware] INFO: Enabled item pipelines:
['sxsneakers.pipelines.SxsneakersPipeline']
2020-06-19 15:47:01 [scrapy.core.engine] INFO: Spider opened
2020-06-19 15:47:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-19 15:47:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-19 15:47:01 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "firefox", "acceptInsecureCerts": true, "proxy": {"proxyType": "manual", "httpProxy": "162.211.122.78:5836", "ftpProxy": "162.211.122.78:5836", "sslProxy": "162.211.122.78:5836"}, "moz:firefoxOptions": {"profile": "UEsDBBQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAdXNlci5qc6VWWXPbNhB+76/g+MmeiWDJrpJM/aQ47jHx0bHq8SMHBJYiLBBAcYhSfn0XpOioDg+lfePxfVjs9e0GBzY1FvLTk8zqCt8I1yWphOK6IjyUhoCimQR+8i7xNsDZ1U/he46kHqwN6gCcU+l60AoqTzMSrETcCc108L9kkqr1yRDe0BUcd77z1PpgSGQgcjqMKnQJe+QRd3nLSvUGPRccSOm8VvUpYqW0hZ4D8BN/UM91gN1sSoTyVj893g6bj0kp6TZlhUXDqWNWGJ9ixFMvymj0ctpNOA5pwWjrUypl+uJS2DIwXmjlOpP+Qje0OZboBkZcoas/1DU+agmdJAW+0nZNGEXKBibRHpUTBGwEG8mrAxas8DvCXFuOPcC3yalAshiv0VIbYhLKuYh+0pFDKsi4RedsSlmMYBowt8F54CkD67ujeUByLkSLrySBH8COsJp4pIrGsKawAdVjaAUKLJWxAC16qDB7+9KNXt3pr0JKej4n0+T0VqiwvUoWilstePKeTK+Se9gGl8yTT0FIfn73uJh/vD9LFsZIeIbsi/Dn88sP5PJ9cvrl97/ubt8lUqwh+Q3YWp8l13XZns9nZEpmHz7OycV8ltzpTEhIljSnVuzZHWX/6upwkeDjInhd0pipm60H5fChG0uNIcFwFC1CkTMKOkp1uHBG0h3wm9l0ea89ljUSfu4B60pJTTn2qMJk2LqDngtQy1iIQq2Gbd083S4O09evzDXykjQFOabiQq2xpUGhAqCeN/V+MYZFdW7GRT9Y5znie5q2BVmgPEYGPDC86a/COr/AWLAeRXlVZJpD/YxRO3I+HDJKKitqj50sQC0r9mUxBpW5CybqXDvp+kYKuFiqOD7wGhZqEcixX1JmqStGrBQgJWEFttlnyGmQ/lPzZ5iG49QRdFs9qGup3YgrB+gHzHg3GCNCmzESo1oAlb5o3slRIj/ADya2ys1/ZLfW81hQj+HHry/1ahU/sma8/cA9jJaC7Uj8uAxZKeo8/1/+n/XHxbeG7joGNl5r6QgqhLb7iw9uclFpUcHqWVK3NceRq1ZpjHy3DWg11tUi+rkhLxmy4wCaTXvhmdRsLYXzwxVxwKgL/FqXBtUdh0bcBZRYFV7uRqnHSPj3cIX6ne+eehvpYAcaimq79TQqP9nr4AS3DR9c98ktpfAeN9hCuGI3ib8NrgcTCWrloyRczOf/prVH49TCfEipqzTbpbzRhM67GZQY413s7rTUHDeD45awHIc2PhFtBbZF2tRo3CtH8M5bwXz6ltZlKi7QCqWwhDIDu6xfe8IVy3wtPNmHrR2n+1LudLyleJBowdsdaUJRe34xhh0M0vdwCy/1OOu8ydYIlH3MFokuY02g+iPj7yBsn4VvlKoQaCZ2UQfjH1BLAQIUAxQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAAAAAAAAAAACkgQAAAAB1c2VyLmpzUEsFBgAAAAABAAEANQAAADkEAAAAAA=="}}}, "desiredCapabilities": {"browserName": "firefox", "acceptInsecureCerts": true, "proxy": {"proxyType": "MANUAL", "httpProxy": "162.211.122.78:5836", "ftpProxy": "162.211.122.78:5836", "sslProxy": "162.211.122.78:5836"}, "marionette": true, "moz:firefoxOptions": {"profile": "UEsDBBQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAdXNlci5qc6VWWXPbNhB+76/g+MmeiWDJrpJM/aQ47jHx0bHq8SMHBJYiLBBAcYhSfn0XpOioDg+lfePxfVjs9e0GBzY1FvLTk8zqCt8I1yWphOK6IjyUhoCimQR+8i7xNsDZ1U/he46kHqwN6gCcU+l60AoqTzMSrETcCc108L9kkqr1yRDe0BUcd77z1PpgSGQgcjqMKnQJe+QRd3nLSvUGPRccSOm8VvUpYqW0hZ4D8BN/UM91gN1sSoTyVj893g6bj0kp6TZlhUXDqWNWGJ9ixFMvymj0ctpNOA5pwWjrUypl+uJS2DIwXmjlOpP+Qje0OZboBkZcoas/1DU+agmdJAW+0nZNGEXKBibRHpUTBGwEG8mrAxas8DvCXFuOPcC3yalAshiv0VIbYhLKuYh+0pFDKsi4RedsSlmMYBowt8F54CkD67ujeUByLkSLrySBH8COsJp4pIrGsKawAdVjaAUKLJWxAC16qDB7+9KNXt3pr0JKej4n0+T0VqiwvUoWilstePKeTK+Se9gGl8yTT0FIfn73uJh/vD9LFsZIeIbsi/Dn88sP5PJ9cvrl97/ubt8lUqwh+Q3YWp8l13XZns9nZEpmHz7OycV8ltzpTEhIljSnVuzZHWX/6upwkeDjInhd0pipm60H5fChG0uNIcFwFC1CkTMKOkp1uHBG0h3wm9l0ea89ljUSfu4B60pJTTn2qMJk2LqDngtQy1iIQq2Gbd083S4O09evzDXykjQFOabiQq2xpUGhAqCeN/V+MYZFdW7GRT9Y5znie5q2BVmgPEYGPDC86a/COr/AWLAeRXlVZJpD/YxRO3I+HDJKKitqj50sQC0r9mUxBpW5CybqXDvp+kYKuFiqOD7wGhZqEcixX1JmqStGrBQgJWEFttlnyGmQ/lPzZ5iG49QRdFs9qGup3YgrB+gHzHg3GCNCmzESo1oAlb5o3slRIj/ADya2ys1/ZLfW81hQj+HHry/1ahU/sma8/cA9jJaC7Uj8uAxZKeo8/1/+n/XHxbeG7joGNl5r6QgqhLb7iw9uclFpUcHqWVK3NceRq1ZpjHy3DWg11tUi+rkhLxmy4wCaTXvhmdRsLYXzwxVxwKgL/FqXBtUdh0bcBZRYFV7uRqnHSPj3cIX6ne+eehvpYAcaimq79TQqP9nr4AS3DR9c98ktpfAeN9hCuGI3ib8NrgcTCWrloyRczOf/prVH49TCfEipqzTbpbzRhM67GZQY413s7rTUHDeD45awHIc2PhFtBbZF2tRo3CtH8M5bwXz6ltZlKi7QCqWwhDIDu6xfe8IVy3wtPNmHrR2n+1LudLyleJBowdsdaUJRe34xhh0M0vdwCy/1OOu8ydYIlH3MFokuY02g+iPj7yBsn4VvlKoQaCZ2UQfjH1BLAQIUAxQAAAAIAOB901AsLaQrFAQAANgNAAAHAAAAAAAAAAAAAACkgQAAAAB1c2VyLmpzUEsFBgAAAAABAAEANQAAADkEAAAAAA=="}}}
2020-06-19 15:47:01 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): 127.0.0.1:54293
2020-06-19 15:47:04 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session HTTP/1.1" 200 867
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/window/rect {"x": null, "y": null, "width": 1366, "height": 768}
2020-06-19 15:47:04 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/window/rect HTTP/1.1" 200 50
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:47:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/url {"url": "https://stockx.com/sneakers"}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/url HTTP/1.1" 200 14
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/cookie {"cookie": {"name": "__cfduid", "value": "df8b1a45e028713c1d91006a180c101db1592572249"}}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "POST /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/cookie HTTP/1.1" 200 14
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48/source {}
2020-06-19 15:48:40 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "GET /session/69979619-6f4b-8d44-ba38-328c7e6d6a48/source HTTP/1.1" 200 935477
2020-06-19 15:48:40 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:44.635 | INFO | sxsneakers.middlewares:download_request:214 - PAGE HTML EXTRACTED SUCCESSFULLY... /nn __CLOSING DRIVER__
2020-06-19 15:48:44 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:54293/session/69979619-6f4b-8d44-ba38-328c7e6d6a48 {}
2020-06-19 15:48:48 [urllib3.connectionpool] DEBUG: http://127.0.0.1:54293 "DELETE /session/69979619-6f4b-8d44-ba38-328c7e6d6a48 HTTP/1.1" 200 14
2020-06-19 15:48:48 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2020-06-19 15:48:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-19 15:48:48 [scrapy.core.scraper] ERROR: Error downloading <GET https://stockx.com/sneakers>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/utils/defer.py", line 55, in mustbe_deferred
result = f(*args, **kw)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/scrapy/core/downloader/handlers/__init__.py", line 76, in download_request
return handler.download_request(request, spider)
File "/Users/lewbra/Desktop/sxsneakers/sxsneakers/middlewares.py", line 217, in download_request
return HtmlResponse(self.driver.current_url, body=body, encoding='utf-8', request=request)
NameError: name 'download_request' is not defined
2020-06-19 15:48:48 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-19 15:48:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/builtins.NameError': 1,
'downloader/request_bytes': 76,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'elapsed_time_seconds': 107.342463,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 6, 19, 14, 48, 48, 434159),
'log_count/DEBUG': 19,
'log_count/ERROR': 1,
'log_count/INFO': 11,
'memusage/max': 55971840,
'memusage/startup': 55443456,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 6, 19, 14, 47, 1, 91696)}
2020-06-19 15:48:48 [scrapy.core.engine] INFO: Spider closed (finished)
Любая помощь в том, в чем проблема, была бы очень принята, заранее спасибо !!