Question

Я пытаюсь разобрать страницы твиттера для разных спортивных команд на турнирах. Чтобы разобрать твиттеры, я сначала должен зайти на веб-страницу со ссылками на все остальные турниры, затем перейти на веб-страницу со всеми командами этого турнира, а затем перейти на веб-страницу команд, чтобы получить твиттер. У меня возникают проблемы при переходе на веб-страницу команд, потому что я не совсем уверен, как вернуть имя твиттера в предыдущую функцию обратного вызова, чтобы я мог поместить все имена твиттеров из этого турнира в список.

В моей последней функции обратного вызова, parse_twitter, я попытался вернуть результат в виде словаря, а затем добавить его к элементу в parse_schedule, но мне не очень повезло

def parse(self, response):
    # Get list of tournaments
    tournaments = Selector(response).xpath('//td/a')
    del tournaments[0]

    # Go through each tournament
    for tourney in tournaments:
        item = FrisbeeItem()
        item['tournament_name'] = tourney.xpath('./text()').extract()[0]
        item['tournament_url'] = tourney.xpath('./@href').extract()[0]

        # make the URL to the teams in the tournament
        tournament_schedule = item['tournament_url'] + '/schedule/Men/CollegeMen/'

        # Request to tournament page
        yield scrapy.Request(url=tournament_schedule, callback=self.parse_schedule, meta={'item' : item})

def parse_schedule(self, response):
    item = response.meta.get('item')

    # Get the list of teams
    tourney_teams = Selector(response).xpath('//div[@class = "pool"]//td/a')

    # For each team in the tournament, get name and URL to team page
    for team in tourney_teams:
        team_name = team.xpath('./text()').extract()[0]
        team_url = 'https://play.usaultimate.org/' + team.xpath('./@href').extract()[0]

        # Request to team page
        yield scrapy.Request(url=team_url, callback=self.parse_twitter, meta={'item': item, 'team_name': team_name})



def parse_twitter(self, response):
    item = response.meta.get('item')
    team_name = response.meta.get('team_name')

    result = {}
    # Get the list containing the twitter
    team_twitter = Selector(response).xpath('//dl[@id="CT_Main_0_dlTwitter"]//a/text()').extract()

    #If a twitter is not listed, put empty string
    if len(team_twitter) == 0:
        result = {'name': team_name, 'twitter': ''}
    else:
        result = {'name': team_name, 'twitter': team_twitter[0]}

    item['tournament_teams'] = result

    yield item

Я хочу что-то близкое к следующему формату:

    {'tournament_name: X,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }
    {'tournament_name: Y,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }

Таким образом, в основном только один элемент для каждого турнира, содержащий имена и твиттеры для каждой команды в этом турнире.

Прямо сейчас, с кодом, который я перечислил, он выплевывает 1 пункт для каждой веб-страницы команды (один элемент для каждой команды в каждом турнире)

Вернуть данные в предыдущую функцию обратного вызова?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Вернуть данные в предыдущую функцию обратного вызова?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Похожие темы