Question

Я очищаю ссылки новостных статей с этой страницы: https://time.com/search/?q=China%20COVID-19&page=1 Я написал код для получения ссылок со страницы 1 и страницы 2, но он возвращает статьи только со страницы 1. Я не знаю, как решить эту проблему, чтобы она успешно возвращала результаты с нескольких страниц.

def scrap(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    params = {
        'q': 'China%20COVID-19',
    }
    pagelinks = []

    myarticle = []
    for page_no in range(1,3):
        params['page'] = page_no
        response = requests.get(url=url,
                                headers=user_agent,
                                params=params) 

                # controlling the crawl-rate
        start_time = time() 
                #pause the loop
        sleep(randint(8,15))
                #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

            #parse the content
        soup_page = bs(response.text, 'lxml') 
                #select all the articles for a single page
        containers = soup_page.findAll("article", {'class': 'partial tile media image-top margin-16-right search-result'})



            scrape the links of the articles
        for i in containers:
            url = i.find('a')['href']
            pagelinks.append(url)
        print(pagelinks)

scrap('https://time.com/search/')

Мы будем очень благодарны за любые предложения!

Andrej Kesely · Answer 1 · 27 мая 2020

Измените часть кода, в которой вы добавляете к pagelinks (не перезаписывайте переменную url, которую вы используете в запросах позже):

#scrape the links of the articles
for i in containers:
    pagelinks.append(i.find('a')['href'])

После этого скрипт печатает:

Request:1; Frequency: 838860.8 request/s
Request:2; Frequency: 1398101.3333333333 request/s
['https://time.com/5841895/global-coronavirus-battle/', 'https://time.com/5842256/world-health-organization-china-coronavirus-outbreak/', 'https://time.com/5826025/taiwan-who-trump-coronavirus-covid19/', 'https://time.com/5836611/china-superpower-reopening-coronavirus/', 'https://time.com/5783401/covid19-hubei-cases-classification/', 'https://time.com/5782633/covid-19-drug-remdesivir-china/', 'https://time.com/5778994/coronavirus-china-country-future/', 'https://time.com/5830420/trump-china-rivalry-coronavirus-intelligence/', 'https://time.com/5810493/coronavirus-china-united-states-governments/', 'https://time.com/5813628/china-coronavirus-statistics-wuhan/', 'https://time.com/5793363/china-coronavirus-covid19-abandoned-pets-wuhan/', 'https://time.com/5779678/li-wenliang-coronavirus-china-doctor-death/', 'https://time.com/5820389/africans-guangzhou-china-coronavirus-discrimination/', 'https://time.com/5824599/china-coronavirus-covid19-economy/', 'https://time.com/5784286/covid-19-china-plasma-treatment/', 'https://time.com/5796425/china-coronavirus-lockdown/', 'https://time.com/5825362/china-coronavirus-lawsuit-missouri/', 'https://time.com/5811222/wuhan-coronavirus-death-toll/']

при итерации страниц возвращались только результаты с первой страницы

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

при итерации страниц возвращались только результаты с первой страницы

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы