Scrapy периодически останавливается - PullRequest
0 голосов
/ 25 февраля 2020

Я пытаюсь сделать единовременную проверку для сайта с более чем 30 000 страниц с помощью Scrapy. Однако мой паук периодически делает паузу (например, на стр. 148, 285, 425, 558) и возобновляется через несколько минут (как видно из журналов ниже). Я использую scrapy_proxy_pool и scrapy-user-agent для ротации IP и User-Agent. Я попытался установить DOWNLOAD_DELAY и AUTOTHROTTLE, но проблема все еще существует. Любая помощь очень ценится!

2020-02-25 22:11:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 12 items/min)
2020-02-25 22:12:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:13:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:14:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:15:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:16:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:17:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:18:08 [scrapy.extensions.logstats] INFO: Crawled 558 pages (at 0 pages/min), scraped 2768 items (at 0 items/min)
2020-02-25 22:18:33 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
2020-02-25 22:18:33 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2020-02-25 22:18:33 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /anonymous-proxy.html HTTP/1.1" 200 None
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.us-proxy.org:443
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: https://www.us-proxy.org:443 "GET / HTTP/1.1" 200 None
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): free-proxy-list.net:443
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: https://free-proxy-list.net:443 "GET /uk-proxy.html HTTP/1.1" 200 None
2020-02-25 22:18:34 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.sslproxies.org:443
2020-02-25 22:18:35 [urllib3.connectionpool] DEBUG: https://www.sslproxies.org:443 "GET / HTTP/1.1" 200 None
2020-02-25 22:18:35 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.free-proxy-list.net:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.free-proxy-list.net:80 "GET / HTTP/1.1" 301 None
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): www.free-proxy-list.net:443
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: https://www.free-proxy-list.net:443 "GET / HTTP/1.1" 200 None
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:36 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:36 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36
2020-02-25 22:18:36 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:37 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:37 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:37 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:37 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
2020-02-25 22:18:37 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:37 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:37 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.myanmartradeportal.gov.mm/commodity-search/view/30989> (referer: https://www.myanmartradeportal.gov.mm/commodity-search/view/1)
2020-02-25 22:18:46 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36
2020-02-25 22:18:46 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:46 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:46 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:46 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.myanmartradeportal.gov.mm/commodity-search/view/30988> (referer: https://www.myanmartradeportal.gov.mm/commodity-search/view/1)
2020-02-25 22:18:47 [scrapy_user_agents.middlewares] DEBUG: Assigned User-Agent Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36
2020-02-25 22:18:47 [urllib3.connectionpool] DEBUG: Starting new HTTP connection (1): www.proxy-daily.com:80
2020-02-25 22:18:47 [urllib3.connectionpool] DEBUG: http://www.proxy-daily.com:80 "GET / HTTP/1.1" 200 185
2020-02-25 22:18:47 [scrapy_proxy_pool.middlewares] WARNING: No proxies available.
2020-02-25 22:18:47 [scrapy_proxy_pool.middlewares] INFO: Try to download with host ip.
2020-02-25 22:18:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET SCRAPING_URL> (referer: REFERER_URL)

Иногда это сообщение появляется во время паузы:

2020-02-25 22:53:06 [scrapy_proxy_pool.middlewares] INFO: Blacklist is cleared.
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...