Я делаю парсинг новостей RSS-канала, используя python3 .7. Я не получил точной информации. Помогите мне получить правильные данные - PullRequest
1 голос
/ 19 июня 2020

Вот я пытаюсь получить новости из RSS-канала, но не получаю точной информации. Я использую запросы и BeautifulSoup для достижения цели. У меня есть следующий объект.

<item>
 <title>
  US making very good headway in respect to Covid-19 vaccines: Donald Trump
 </title>
 <description>
  <a href="https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/76399892.cms" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
 </description>
 <link>
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 </link>
 <guid>
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 </guid>
 <pubDate>
  Mon, 15 Jun 2020 22:11:06 PT
 </pubDate>
</item>

Код проблемы желания здесь ..

def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL)
    soup = BeautifulSoup(page.content, features = 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        news_description = news_elem.description.text       
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text

Мне нужно описание из тега, но он содержит более подробную информацию, например, и который не требуется в описании. Приведенный выше код дает следующий результат.

    {
      "image": null,
      "news_description": "<a href=\"https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms\"><img border=\"0\" hspace=\"10\" align=\"left\" style=\"margin-top:3px;margin-right:5px;\" src=\"https://timesofindia.indiatimes.com/photo/76399892.cms\" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

Ожидаемый результат ===>

    {
      "image": "image/link/from/the/description",
      "news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

1 Ответ

1 голос
/ 19 июня 2020

< > заменено на &lt; и &gt. Вот почему я использую formatter=None и меняю что-то, чтобы управлять им. См. news_description. Я думаю, вы получили свой результат. вы можете попробовать:

import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}


def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL,headers=headers)
    soup = BeautifulSoup(page.text, 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    # print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        n_description = news_elem.description
        store = n_description.prettify(formatter=None)
        sp = BeautifulSoup(store, 'xml')
        news_description = sp.find("a").nextSibling
        print(news_description)
        # print(news_description)
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text


timesofindiaNews()

вывод будет:

Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.

The proposed suspension could extend into the government's new fiscal year beginning October 1, when many new visas are issued, The Wall Street Journal reported on Thursday, quoting unnamed administration officials.

The team of researchers at the University of Georgia (UGA) in the US noted that the SARS-CoV-2 protein PLpro is essential for the replication and the ability of the virus to suppress host immune function.

After two weeks of protests over the death of George Floyd, hundreds of New Yorkers took to the streets again calling for reform in law enforcement and the withdrawal of police department funding.

Indian-origin California Senator Kamala Harris has joined former vice president and 2020 Democratic presidential nominee Joe Biden to raise USD 3.5 million for the upcoming November elections.


and so on....
...