scrapy-splash не возвращает html, обработанный splash - PullRequest
0 голосов
/ 02 апреля 2019

Я установил splash и scrapy-splash в виртуальной среде Python (Ubuntu 16.04), следуя инструкциям, описанным в файле README (включая настройку промежуточного программного обеспечения и т. Д.).Несмотря на то, что я не получаю никакой ошибки в файле журнала (очевидно), html, возвращаемый ScrapySplash, не содержит html, обработанный Splash, только html, загруженный Scrapy (без использования splash).

В некоторых ситуациях я могу получить правильный HTML.Это:

Однако scrapy-splash не возвращает правильный HTML с использованием SplashRequest:

yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 0.5})

Это моя конфигурация в файле settings.py:

SPIDER_MIDDLEWARES = {
  'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
  'scrapy_splash.SplashCookiesMiddleware': 723,
  'scrapy_splash.SplashMiddleware': 725,
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_COOKIES_DEBUG = True

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Ожидается вывод html, обработанного с помощью splash, но он возвращает только html без обработки.

Докер Splash:

process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
See the manual page for dbus-uuidgen to correct this issue.
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method
2019-04-17 14:35:28.198194 [events] {"timestamp": 1555511728, "status_code": 200, "user-agent": "Scrapy/1.3.3 (+http://scrapy.org)", "client_ip": "172.17.0.1", "load": [0.15, 0.38, 0.35], "rendertime": 5.785578966140747, "active": 0, "fds": 68, "qsize": 0, "method": "POST", "_id": 140284272664528, "path": "/render.html", "args": {"headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Scrapy/1.3.3 (+http://scrapy.org)", "Accept-Language": "en", "Cookie": "__cfduid=d035cc38f38ee9f555aec777db4b1b8f81555511718"}, "uid": 140284272664528, "wait": 0.5, "url": "https://www.tampabay.com/events/"}, "maxrss": 159672}
2019-04-17 14:35:28.198893 [-] "172.17.0.1" - - [17/Apr/2019:14:35:27 +0000] "POST /render.html HTTP/1.1" 200 34075 "-" "Scrapy/1.3.3 (+http://scrapy.org)

Сообщение Scrapy LOG:

2019-04-17 16:35:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tampabay)
2019-04-17 16:35:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tampabay.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['tampabay.spiders'], 'BOT_NAME': 'tampabay', 'LOG_FILE': 'tampabay.log', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'DOWNLOAD_DELAY': 3}
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tampabay.pipelines.TampabayPipeline']
2019-04-17 16:35:18 [scrapy.core.engine] INFO: Spider opened
2019-04-17 16:35:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-17 16:35:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/robots.txt> (referer: None)
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://127.0.0.1:8050/robots.txt> (referer: None)
2019-04-17 16:35:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/events/ via http://127.0.0.1:8050/render.html> (referer: None)
2019-04-17 16:35:28 [tampabay] DEBUG: ############## INSIDE FUNCTION -> parse ############### 
2019-04-17 16:35:28 [tampabay] DEBUG: EVENTS: 0
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-17 16:35:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1037,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 35911,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 17, 14, 35, 28, 333825),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2019, 4, 17, 14, 35, 18, 83737)}
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Spider closed (finished)
...