Паук-обходчик Scrapy не переходит по всем ссылкам и не заполняет загрузчик предметов - PullRequest
0 голосов
/ 17 ноября 2018

Я пытаюсь взять ссылки для этого веб-сайта (https://minerals.usgs.gov/science/mineral-deposit-database/#products) и вычистить заголовок каждого из них. Однако это не работает! Похоже, что паук не следует по ссылкам!

КОД

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import datetime
import socket
from depositsusa.items import DepositsusaItem
from scrapy.loader import ItemLoader


class DepositsSpider(CrawlSpider):
    name = 'deposits'
    allowed_domains = ['web']
    start_urls = ['https://minerals.usgs.gov/science/mineral-deposit- 
    database/#products', ]

    rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[@id="products"][1]/p/a'),
         callback='parse'),
    )

    def parse(self, response):
        i = ItemLoader(item=DepositsusaItem(), response=response)
        i.add_xpath('name', '//*[@class="container"][1]/header/h1/text()')
        i.add_value('url', response.url)
        i.add_value('project', self.settings.get('BOT_NAME'))
        i.add_value('spider', self.name)
        i.add_value('server', socket.gethostname())
        i.add_value('date', datetime.datetime.now())
        return i.load_item()

товар

import scrapy
from scrapy.item import Item, Field


class DepositsusaItem(Item):
    # main fields
    name = Field()
    # Housekeeping Fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()
    pass

OUTPUT

(base) C:\Users\User\Documents\Python WebCrawling Learing 
Projects\depositsusa>scrapy crawl deposits
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: 
depositsusa)
2018-11-17 00:29:48 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 
2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python 
3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], 
pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform 
Windows-10-10.0.17134-SP0
2018-11-17 00:29:48 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 
'depositsusa', 'NEWSPIDER_MODULE': 'depositsusa.spiders', 'ROBOTSTXT_OBEY': 
True, 'SPIDER_MODULES': ['depositsusa.spiders']}
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled downloader 
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-17 00:29:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-17 00:29:48 [scrapy.core.engine] INFO: Spider opened
2018-11-17 00:29:48 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2018-11-17 00:29:48 [scrapy.extensions.telnet] DEBUG: Telnet console 
listening on 127.0.0.1:6024
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://minerals.usgs.gov/robots.txt> (referer: None)
2018-11-17 00:29:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://minerals.usgs.gov/science/mineral-deposit-database/#products> 
(referer: None)
2018-11-17 00:29:49 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://minerals.usgs.gov/science/mineral-deposit-database/>
{'date': [datetime.datetime(2018, 11, 17, 0, 29, 49, 832526)],
'project': ['depositsusa'],
'server': ['DESKTOP-9CUE746'],
'spider': ['deposits'],
'url': ['https://minerals.usgs.gov/science/mineral-deposit-database/']}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-17 00:29:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 475,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 25123,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 16, 23, 29, 49, 848053),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 16, 23, 29, 48, 520273)}
2018-11-17 00:29:49 [scrapy.core.engine] INFO: Spider closed (finished)

Я новичок в python, так в чем проблема? Это как-то связано с linkextraction или функцией parse?

1 Ответ

0 голосов
/ 17 ноября 2018

Вы должны изменить пару вещей.

Во-первых, когда вы используете CrawlSpider, вы не можете иметь обратный вызов с именем parse, так как вы переопределяете CrawlSpider 's parse: https://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

warning

Во-вторых, вы хотите иметь правильный список allowed_domains.

Попробуйте что-то вродеэто:

class DepositsSpider(CrawlSpider):
    name = 'deposits'
    allowed_domains = ['doi.org']
    start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products', ]

    rules = (
    Rule(LinkExtractor(restrict_xpaths='//*[@id="products"][1]/p/a'),
         callback='parse_x'),
    )

    def parse_x(self, response):
        i = ItemLoader(item=DepositsusaItem(), response=response)
        i.add_xpath('name', '//*[@class="container"][1]/header/h1/text()')
        i.add_value('url', response.url)
        i.add_value('project', self.settings.get('BOT_NAME'))
        i.add_value('spider', self.name)
        i.add_value('server', socket.gethostname())
        i.add_value('date', datetime.datetime.now())
        return i.load_item()
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...