Кажется, что python не получает данные со всех доступных URL - PullRequest
5 голосов
/ 26 ноября 2011

Я пытаюсь поцарапать thesession.org , чтобы создать таблицу того, сколько раз каждая мелодия добавлялась в сборники песен memeber, чтобы я мог найти некоторые популярные пьесы для изучения.Я начал с учебника по Scrapy здесь и пытаюсь изменить его в соответствии с моими целями.Проблема в том, что, хотя на сайте thesession.org, похоже, имеется 10 390 мелодий, мой скребок возвращает данные только по 10 из них (только по http://www.thesession.org/tunes/index.php). Как мне получить данные по всем мелодиям (или по верхнимсто мелодий) Любой совет был бы очень признателен.

Вот что я получил до сих пор:

items.py

from scrapy.item import Item, Field

class tuneItem(Item):
    url = Field()
    name1 = Field()
    name2 = Field()
    key = Field()
    count = Field() 
    pass

tune_spider.py

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings

class tunesSpider(CrawlSpider):

    name = "irishtunes"
    allowed_domains = ["thesession.org"]
    start_urls = ["http://www.thesession.org/tunes"]
    rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]

    def parse_tune(self, response):
        x = HtmlXPathSelector(response)

        tune = tuneItem()
        tune['url'] = response.url
        tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
        tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
        tune['key']   = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
        tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
        return tune

Я запускаю скребок, открывая консоль, захожу в каталог, содержащий файл cfg учебника, и запускаю scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv

Вот что я получаю:

C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
        {'count': [u'1'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Brendan Begley's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
        {'count': [u'3'],
         'key': [u'Key signature: Amajor'],
         'name1': [u'Carleton County Breakdown'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
        {'count': [u'3'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Kasper's Rant"],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
        {'count': [u'5'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'The Full Of The Bag'],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
        {'count': [u'1'],
         'key': [u'Key signature: Adorian'],
         'name1': [u'The New Steamboat'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
        {'count': [u'4'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u"Galen's Arrival"],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
        {'count': [u'2'],
         'key': [u'Key signature: Amixolydian'],
         'name1': [u'Culloden Day'],
         'name2': [u'strathspey'],
         'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
        {'count': [u'2'],
         'key': [u'Key signature: Aminor'],
         'name1': [u'Miss Sine Flemington'],
         'name2': [u'barndance'],
         'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
        {'count': [u'2'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Joan Martin's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
        {'count': [u'2'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'My Time Inside 2005'],
         'name2': [u'waltz'],
         'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
        {'downloader/request_bytes': 3655,
         'downloader/request_count': 12,
         'downloader/request_method_count/GET': 12,
         'downloader/response_bytes': 31620,
         'downloader/response_count': 12,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
         'item_scraped_count': 10,
         'request_depth_max': 1,
         'scheduler/memory_enqueued': 12,
         'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
        {}

РЕДАКТИРОВАТЬ: Ответ от @reclosedev дал мне дорогу. Для всех, кто интересуется результатом, вот снимок ...

(1) Подавляющее большинство мелодий меньше, чемTunebooks 10 участников

enter image description here

(2) Популярность всех 10 379 мелодий, которые я мог вычистить с сайта (измеряемая количеством учебников, в которых они находятся), зависит отзакон распределения

enter image description here

(3)Нет, которые находятся в> 1000 Tunebooks на сайте, показывая имена песен с самым высоким рейтингом и сколько Tunebooks они в

enter image description here

Ответы [ 2 ]

5 голосов
/ 26 ноября 2011

Вам необходимо добавить Rule, который будет извлекать ссылки на все страницы, и паук follow it:

rules = [
    ..., #your existing parse_tune rule
    Rule(
        SgmlLinkExtractor(
             allow=('/index/new\?new_start=\d+',)
        ),
        follow=True,
    ),
]

edit:

follow=True не требуетсяпотому что callback=None по умолчанию означает follow=True.

0 голосов
/ 26 ноября 2011
...