Вчера я начал изучать Scrapy, чтобы извлечь некоторую информацию, но я не могу понять, что нумерация страниц правильная. Я следовал учебному пособию здесь , но я думаю, что у сайта другая система разбиения на страницы.
У большинства страниц есть class = "next" , но у этого его нет. Имеется только список, в котором текущая страница указана в виде диапазона с классом current:
<div class="pagination">
<ul class="page-numbers">
<li><span class='page-numbers current'>1</span></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/2/'>2</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/3/'>3</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/4/'>4</a></li>
<li><a class='page-numbers' href='https://www.musicfestivalwizard.com/all-festivals/page/5/'>5</a></li>
А вот мой скребок:
import scrapy
class MfwspiderSpider(scrapy.Spider):
name = 'mfwspider'
allowed_domains = ['www.musicfestivalwizard.com']
start_urls = ['https://www.musicfestivalwizard.com/all-festivals/',]
def parse(self, response):
pagenumber = 1
for festival in response.css("span.festivalleft"):
yield {
'date' : festival.css(".festivaldate::text").extract(),
'location' : festival.css(".festivallocation::text").extract_first(),
'title' : festival.css(".festivaltitle > a::text").extract_first(),
next_page = start_urls[0] + str(pagenumber) + "/"
if next_page is not None:
yield response.follow(next_page, callback=self.parse,)
Как видите, я добавил несколько операторов print () для отладки. и вот мой вывод консоли:
scrapy crawl mfwspider
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: lineups)
2018-05-06 00:21:45 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.4.0, Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-05-06 00:21:45 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'lineups', 'NEWSPIDER_MODULE': 'lineups.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['lineups.spiders']}
2018-05-06 00:21:45 [scrapy.middleware] INFO: Enabled extensions:
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled spider middlewares:
2018-05-06 00:21:46 [scrapy.middleware] INFO: Enabled item pipelines:
2018-05-06 00:21:46 [scrapy.core.engine] INFO: Spider opened
2018-05-06 00:21:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-06 00:21:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/robots.txt> (referer: None)
2018-05-06 00:21:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: None)
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> (referer: https://www.musicfestivalwizard.com/all-festivals/)
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 12-13, 2018'], 'location': 'Chiba, Japan', 'title': 'Electric Daisy Carnival Japan 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Arcosanti, AZ', 'title': 'FORM Arcosanti Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Beats Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 11-13, 2018'], 'location': 'Miami, FL', 'title': 'Rolling Loud Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-19, 2018'], 'location': 'Brighton, UK', 'title': 'The Great Escape 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Gulf Shores, AL', 'title': 'Hangout Fest 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Saint-Laurent-de-Cuves, France', 'title': 'Papillons De Nuit 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['June 19-20, 2018'], 'location': 'Margny-lès-Compiègne, France', 'title': 'Imaginarium Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': [' May 18-20, 2018'], 'location': 'Columbus, OH', 'title': 'Rock on the Range 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Durham, NC', 'title': 'Moogfest 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 19-20, 2018'], 'location': 'Paris, France', 'title': 'Marvellous Island Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Montreal, QC', 'title': 'Pouzza Fest 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-20, 2018'], 'location': 'Houthalen-Helchteren, Belgium', 'title': 'Extrema Outdoor Belgium 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 17-20, 2018'], 'location': 'Joshua Tree, CA', 'title': 'Joshua Tree Festival Spring 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/page/2/>
{'date': ['May 18-21, 2018'], 'location': 'Las Vegas, NV', 'title': 'Electric Daisy Carnival Vegas 2018'}
2018-05-06 00:21:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musicfestivalwizard.com/all-festivals/> (referer: https://www.musicfestivalwizard.com/all-festivals/page/2/)
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'Numero Uno, Malta', 'title': 'Lost And Found Malta 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 6, 2018'], 'location': 'New Orleans, LA', 'title': 'New Orleans Jazz Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 2-May 6, 2018'], 'location': 'West Palm Beach, FL', 'title': 'Sunfest 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Memphis, TN', 'title': 'Beale Street Music Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 5-6, 2018'], 'location': 'Liverpool, UK', 'title': 'Liverpool Sound City 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4–6, 2018'], 'location': 'Atlanta, GA', 'title': 'Shaky Knees Festival 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Concord, NC', 'title': 'Carolina Rebellion 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Winooski, VT', 'title': 'Waking Windows 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 4-6, 2018'], 'location': 'Texas Tour', 'title': 'JMBLYA 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 3-6, 2018'], 'location': 'San Diego, CA', 'title': 'West Coast Weekender 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['April 27-May 12, 2017'], 'location': 'Australia Tour', 'title': 'Groovin’ The Moo 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 7-13. 2018'], 'location': 'Toronto, ON', 'title': 'Canadian Music Week 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 11-13, 2018'], 'location': 'London, UK', 'title': 'Peckham Rye 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 12-13, 2018'], 'location': 'Somerset, WI', 'title': 'Northern Invasion 2018'}
2018-05-06 00:21:47 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.musicfestivalwizard.com/all-festivals/>
{'date': ['May 6-13, 2018'], 'location': 'Lyon, France', 'title': 'Nuits Sonores 2018'}
2018-05-06 00:21:47 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.musicfestivalwizard.com/all-festivals/page/2/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Closing spider (finished)
2018-05-06 00:21:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1092,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 48590,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 5, 22, 21, 47, 746610),
'item_scraped_count': 45,
'log_count/DEBUG': 51,
'log_count/INFO': 7,
'memusage/max': 66899968,
'memusage/startup': 66899968,
'request_depth_max': 3,
'response_received_count': 4,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 5, 5, 22, 21, 46, 20038)}
2018-05-06 00:21:47 [scrapy.core.engine] INFO: Spider closed (finished)
Я думаю, что мне нужно что-то, чтобы выбрать ли после . Как я могу сделать это в скрапе? Есть ли лучший способ сделать это?