Question

Я очищаю веб-сайт Dmoz и очищаю страницу about, но когда я сделал другую функцию с именем parse_editor и попытался очистить, это не дает мне результат.

from ..items import DmoztutorialItem
import scrapy


class DmozSpiderSpider(scrapy.Spider):
    name = 'Dmoz'
    start_urls = ['http://dmoz-odp.org/']
    about_page = 'http://dmoz-odp.org/docs/en/about.html'
    editor = 'http://dmoz-odp.org/docs/en/help/become.html'

    def parse(self, response):
        # collect data on first page
        items = {
            'Navbar': response.css('#main-nav a::text').extract(),
            'Category_names': response.css('.top-cat a::text').extract(),
            'Subcategories': response.css('.sub-cat a::text').extract(),
            'About_page': self.about_page,
            'Become_an_editor': self.editor
        }

        # save and call request to another page
        yield response.follow(self.about_page, self.parse_about, self.editor, self.parse_editor, meta={'items': items})

    def parse_about(self, response):
        # do your stuff on second page
        items = response.meta['items'] 
        items['Headings'] = response.css('h2::text , #mainContent h1::text').extract()  # add your logics
        items['Paragraphs'] = response.css('p::text').extract()
        items['3 Projects'] = response.css('li~ li+ li b a::text , li:nth-child(1) b a::text').extract()
        items['About Dmoz'] = response.css('.nav ul a::text , li:nth-child(2) b a::text').extract()
        items['Languages'] = response.css('.nav~ .nav a::text').extract()
        items['You can make a difference'] = response.css('dd::text , #about-contribute::text').extract()
        items['Further information'] = response.css('li::text , #about-more-info a::text').extract()
        yield items

    def parse_editor(self, response):
        # do your stuff on third page
        editor_items = response.meta['items']
        editor_items['Heading'] = response.css('#mainContent h1::text').extract()
        yield editor_items

vezunchik · Answer 1 · 16 апреля 2019

Вы пишете все в одном response.follow, это неправильно. Требуется одна пара url-callback. Поэтому напишите их в двух отдельных функциях:

Неверный вариант:

yield response.follow(self.about_page, self.parse_about, self.editor, self.parse_editor, meta={'items': items})

Правильный вариант:

yield response.follow(self.about_page, self.parse_about, meta={'items': items})
yield response.follow(self.editor, self.parse_editor, meta={'items': items})

Вы можете сначала написать follow в функции parse; вызовите parse_about и сделайте второй follow и получите последний элемент в функции parse_editor.

Почему последняя функция не выполняется?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Почему последняя функция не выполняется?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов