Question

Я написал этот короткий код паука, чтобы извлечь заголовки с главной страницы хакерских новостей (http://news.ycombinator.com/).

import scrapy

class HackerItem(scrapy.Item): #declaring the item
    hackertitle = scrapy.Field()


class HackerSpider(scrapy.Spider):
    name = 'hackernewscrawler'
    allowed_domains = ['news.ycombinator.com'] # website we chose
    start_urls = ['http://news.ycombinator.com/']

   def parse(self,response):
        sel = scrapy.Selector(response) #selector to help us extract the titles
        item=HackerItem() #the item declared up

# xpath of the titles
        item['hackertitle'] = 
sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract()


# printing titles using print statement.
        print (item['hackertitle']

Однако, когда я запускаю код scrapy scrawl hackernewscrawler -o hntitles.json -t json

iполучить пустой файл .json, в котором нет содержимого.

FcknGioconda · Answer 1 · 02 декабря 2018

Вы должны изменить print оператор на yield:

import scrapy

class HackerItem(scrapy.Item): #declaring the item
    hackertitle = scrapy.Field()


class HackerSpider(scrapy.Spider):
    name = 'hackernewscrawler'
    allowed_domains = ['news.ycombinator.com'] # website we chose
    start_urls = ['http://news.ycombinator.com/']

    def parse(self,response):
        sel = scrapy.Selector(response) #selector to help us extract the titles
        item=HackerItem() #the item declared up

# xpath of the titles
        item['hackertitle'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract()


# return items
        yield item

Затем запустите:

scrapy crawl hackernewscrawler -o hntitles.json -t json

Пустой файл .json

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пустой файл .json

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы