не удалось найти элемент и очистить содержимое с помощью BeautifulSoup - PullRequest
0 голосов
/ 27 мая 2020

Я очищаю основную часть этой страницы: https://time.com/5841895/global-coronavirus-battle/

Сначала я использовал soup.find, чтобы найти контейнер; затем я использовал find_all для поиска каждого абзаца.

Но я получил это сообщение об ошибке:


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-41-93c17c229d31> in <module>
----> 1 scrap('https://time.com/search/')

<ipython-input-40-db7010f17eac> in scrap(url)
     41                 #containerr = soup.find("div", class_=['article-content', 'karma-main-column'])
     42                 containerr = soup.find("div", {'class': 'padded'})
---> 43                 articletext = containerr.find_all('p')
     44                 thearticle = [] # clear from the previous loop
     45                 paragraphtext = [] # clear from the previous loop

AttributeError: 'NoneType' object has no attribute 'find_all'

Я подумал, что это может быть из-за неправильного элемента контейнера, но я пробовал разные элементы, ни один из них не работает.

Вот мой код:


def scrap(url):
    user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}
    request = 0
    params = {
        'q': 'China%20COVID-19',
    }
    pagelinks = []

    myarticle = []
    for page_no in range(1,3):
        params['page'] = page_no
        response = requests.get(url=url,
                                headers=user_agent,
                                params=params) 

                # controlling the crawl-rate
        start_time = time() 
                #pause the loop
        sleep(randint(8,15))
                #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
        clear_output(wait = True)

        #parse the content
        soup_page = bs(response.text, 'lxml') 
        #select all the articles for a single page
        containers = soup_page.findAll("article", {'class': 'partial tile media image-top margin-16-right search-result'})
        #scrape the links of the articles
        for i in containers:
            pagelinks.append(i.find('a')['href'])

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                #containerr = soup.find("div", class_=['article-content', 'karma-main-column'])
                containerr = soup.find("div", {'class': 'padded'})
                articletext = containerr.find_all('p')
                thearticle = [] # clear from the previous loop
                paragraphtext = [] # clear from the previous loop                        
                for paragraph in articletext:
                    #get the text only
                    text = paragraph.get_text()

                    paragraphtext.append(text)

                thearticle.append(paragraphtext)

            myarticle.append(thearticle)   

Любые предложения приветствуются!

1 Ответ

1 голос
/ 27 мая 2020

Вероятно, исключение связано с запросом; при использовании запросов вам необходимо указать пользовательский агент, но в вашем случае urllib будет работать отлично (у меня также были проблемы с запросами : иногда это не так ' t загрузить всю страницу)

Обновление вашего кода

Предполагая, что вы выполняете парсинг с использованием списка ссылок:

from bs4 import BeautifulSoup
import urllib
import requests
pagelinks = ["https://time.com/5841895/global-coronavirus-battle/", "https://time.com/5842982/japan-arrest-anime-arson/"]
for url_page in pagelinks:

    req = urllib.request.Request(url_page,data=None)
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    soup = BeautifulSoup(page, 'html.parser')


    articletext = soup.find_all('p')
    text = str(articletext)
    print(text)

вы будете печатать для каждой итерации , что-то вроде

[<p>(Berlin) — New coronavirus cases in China fell to zero on Saturday for the first time but surged in India and overwhelmed hospitals across Latin America – both in countries lax about lockdowns and those lauded for firm, early confinement. The virus hit a reopened church in Germany and probably a restaurant, too.</p>, <p>The pandemic’s persistence stymied authorities struggling to keep people safe and revive their economies at the same time, disrupting Memorial Day weekend in the United States and collective celebrations around the Muslim world marking the end of the holy month of Ramadan.</p>, <p>Rain dampened the start of the holiday weekend in the northeastern U.S., where newly opened beaches had been expected to attract throngs of people and test the effectiveness of social distancing rules.</p>, <p>However, President Donald Trump visited one of his private golf clubs for the first time during pandemic — the Trump National Golf Club in northern Virginia. He has been <a href="https://time.com/5836607/reopening-risks-coronavirus/">pushing for states to fully reopen</a> months after closing businesses and outdoor venues to help slow the spread of the virus.</p>, <p>In countries with weak health care systems, impoverished populations and not enough clean water, fighting the virus is increasingly difficult.</p>, <p>“I’m a mother, if I don’t go out and sell, my children won’t have food to eat. I am obliged to go out and come here to sell products, despite the danger that we are in,” said Nagnouma Kante, a market vendor in Guinea’s capital Conakry.</p>, <p>Turkey imposed its toughest lockdown measures yet starting Saturday for the Eid al-Fitr holiday marking the end of Ramadan, and Yemen’s Houthi rebels urged believers to use masks and stay inside, as authorities try to contain infections at a time usually marked by days of multigenerational feasting and collective prayer.</p>, <p>Elsewhere, many governments are easing restrictions as they face a political backlash and historic recessions brought on by the battle against the virus. In just a few months, the pandemic has killed at least 338,000 people worldwide and infected more than 5.2 million, according to a tally kept by Johns Hopkins University.</p>, <p>In <a href="https://time.com/5812555/germany-coronavirus-deaths/">Germany, which has drawn praise for its handling of the virus</a>, seven people appear to have been infected at a restaurant in the northwest of the country. It would be the first known such case since restaurants started reopening two weeks ago.</p>, <p>And in the southwestern city of Frankfurt, more than 40 people tested positive after a church service of the Evangelical Christian Baptist congregation on May 10. The city’s health office said one is hospitalized.</p>, <p>A church leader said the community had complied with all hygiene rules but has canceled all gatherings and is now holding services online. Authorities in nearby Hanau decided to call off Muslim prayers planned for a stadium Sunday as a precaution.</p>, <p>The new infections are not seen as a threat to Germany’s overall virus strategy, and Chancellor Angela Merkel said the country had “succeeded so far in achieving the aim of preventing our health system being overwhelmed.”</p>, <p>Religious events helped spread the virus early in the pandemic, and <a href="https://time.com/5837693/should-churches-reopen-thinking-about-exile/">resuming gatherings of the faithful is an especially thorny issue</a>.</p>, <p>Mindful of evangelical Christians who are key to his support base ahead of November’s election, Trump on Friday labeled houses of worship as “essential” and called on governors to let them reopen this weekend.</p>, <p>France allowed religious services to resume starting Saturday after a legal challenge to the government’s ban on gatherings in places of worship.</p>, <p>One of the world’s major pilgrimage sites is reopening Sunday: the Church of the Holy Sepulcher in Jerusalem, built on the site where Christians believe Jesus was crucified, buried and resurrected.</p>, <p>Latin America is the latest epicenter of the virus, and experts note the limits of government action in a region where millions have informal jobs and many police forces are weak or corrupt and unable to enforce restrictions.</p>, <p><a href="https://time.com/5840208/brazil-coronavirus/">Brazil</a> and Mexico reported record numbers of infections and deaths almost daily this week, fueling criticism of their presidents for limited lockdowns. But infections also rose and intensive care units were swamped in Peru, Chile and Ecuador, all countries lauded for imposing early and aggressive business shutdowns and quarantines.</p>, <p>In the U.S., some regions are opening more quickly than others. California is preparing its wineries for visitors next week, and Las Vegas casinos could reopen June 4.</p>, <p>New Yorkers were offered an unexpected reprieve when Gov. Andrew Cuomo eased the virus-ravaged state’s ban on gatherings in time for the Memorial Day weekend, when Americans honor fallen military service members, hold picnics and head outdoors on what’s traditionally seen as the kickoff to summer.</p>, <p>Some families plan to go to beaches or national parks for the first time since the virus hit, and Interior Secretary David Bernhardt is scheduled to visit the Grand Canyon on Saturday.</p>, <p>The U.S. has been the hardest-hit country, with more than 96,000 deaths among 1.6 million confirmed cases, followed by Russia and Brazil, according to the Johns Hopkins count.</p>, <p>One sign of hope emerged Saturday: China, where the outbreak began late last year, reported no new confirmed cases for the first time.</p>, <p>As Japan reopens, guidelines were released for bar hostesses and other nightlife workers to wear masks, gargle every 30 minutes and disinfect karaoke microphones after each use. South Korea reopened then shut down thousands of clubs after more than <a href="https://time.com/5834991/south-korea-coronavirus-nightclubs/">200 recent infections were linked to clubgoers in Seoul.</a></p>, <p><a href="https://time.com/5812394/india-coronavirus-lockdown-modi/">Concerns are rising in India</a>, where new cases showed another record jump Saturday, topping 6,000 for a second consecutive day as a two-month lockdown has eased. States with relatively few cases have seen spikes in recent days as residents, including migrant workers traveling on special trains, have returned home.</p>, <p>While some countries are facing a second wave of infections, badly hit <a href="https://time.com/5836890/russia-coronavirus/">Russia is still struggling with its first</a>, and reported more than 9,000 new daily cases Saturday.</p>, <p>___</p>, <p>Charlton reported from Paris and Kageyama from Tokyo. Associated Press writers around the world contributed.</p>, <p class="author-feedback-text"><strong>Contact us</strong> at <a href="mailto:editors@time.com?subject=(READER FEEDBACK) As New China Cases of COVID-19 Drop to Zero, Infections Skyrocket in India and Latin America" rel="noopener noreferrer" target="_self">editors@time.com</a>.</p>]

Вы можете сохранить текст (-ы) статьи в другом списке, чтобы вы могли обрабатывать данные и извлекать то, что вам нужно; например

Извлечение текста

articles_list = []
pagelinks = ["https://time.com/5841895/global-coronavirus-battle/", "https://time.com/5842982/japan-arrest-anime-arson/"]
for url_page in pagelinks:

    req = urllib.request.Request(url_page,data=None)
    f = urllib.request.urlopen(req)
    page = f.read().decode('utf-8')
    soup = BeautifulSoup(page, 'html.parser')


    articletext = soup.find_all('p')
    articles_list.append(articletext)

    text = str(articletext)
    print(text)

#extract the text in each article
for article in articles_list:
    for element in article:
        print(element.get_text())
...