import scrapy
class IdealistaspiderSpider(scrapy.Spider):
name = 'idealistaspider'
allowed_domains = ['idealista.pt']
start_urls = ['https://www.idealista.pt/en/comprar-casas/lisboa/com-publicado_ultimas-24-horas//',
]
def parse(self, response):
print ("Entered in parser..............................")
next_page = response.css('a.icon-arrow-right-after::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Ошибка входа в функцию синтаксического анализа. журналы ошибок ниже
2020-05-09 16:39:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-09 16:39:27 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-09 16:39:27 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.pt/en/comprar-casas/lisboa/com-publicado_ultimas-24-horas//> (referer: None)
2020-05-09 16:39:27 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.idealista.pt/en/comprar-casas/lisboa/com-publicado_ultimas-24-horas//>: HTTP status code is not handled or not allowed
2020-05-09 16:39:27 [scrapy.core.engine] INFO: Closing spider (finished)
Я уже установил USER-AGENT, как и многие другие решения для inte rnet, скажем,
USER_AGENT = 'Mozilla /5.0 (Windows NT 10.0; Win64; x64) AppleWebKit / 537.36 (K HTML, как Gecko) Chrome / 61.0.3163.100 Safari / 537.36 '
Даже пробовал создать отдельный скрипт (run.py) содержимое ниже, но проблема не устранена
from scrapy.crawler import CrawlerProcess
from spiders import idealistaspider
### Idealista
process = CrawlerProcess(settings={
'ROBOTSTXT_OBEY': False,
'CONCURRENT_REQUESTS': 1,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
})
process.crawl(idealistaspider.IdealistaspiderSpider)
process.start()