Как получить только новостную часть из HTML? - PullRequest
0 голосов
/ 19 февраля 2019

Я довольно новичок в программировании. Но мне нужна только новостная статья, есть ли простой способ удалить ненужный HTML из текста, так как мне нужно дополнительно перебрать несколько ссылок, а затем выполнить анализ настроений над ними.

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'
html = requests.get(p)
    soup1 = BeautifulSoup(html.text,'html.parser')
    date = soup1.find_all("div", {"class":"arttidate"})
    print(date)
    article = soup1.find_all("script", {"class":"arti-flow"})
    print(article)

Выходные данные следующие

[ < div class = "arttidate " > Last Updated: Feb 07, 2019 03: 05 PM IST | Source: < span > Moneycontrol.com < /span></div > ]
[ < div class = "arti-flow"
    id = "article-main" >
    <!-- .CONTENT BODY -->
    <
    p > < div class = "top_dis"
    id = "div_app_container" > < b > Reported Consolidated quarterly numbers
    for Cadila Healthcare are: < /b></div > < /p><p>Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 crore in December 2017.</p > < p > Quarterly Net Profit at Rs.510.70 crore in December 2018 down 6 % from Rs.543.30 crore in December 2017. < /p><div class="ads-320-250 show-moblie mid-arti-ad"><div id="Moneycontrol_Mobile_WAP/MC_WAP_News / MC_WAP_News_Internal_300x250_Middle_2 "> <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_300x250_Middle_2")
        });
    }

    <
    /script> <
    /div></div > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol_Mobile_WAP/MC_WAP_News/MC_WAP_News_Internal_OutStream")
        });
    }

    <
    /script> <
    /div></div > < script >
    date = new Date();
    date.setTime(date.getTime() + (1 * 24 * 60 * 60 * 1000));
    $.cookie("dfp_cookie_article", "Y1", {
        expires: date,
        path: "/",
        domain: ".moneycontrol.com"
    }); < /script><p>EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. 882.30 crore in December 2017.</p > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol/MC_News/MC_News_Internal_Article_Native" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol/MC_News/MC_News_Internal_Article_Native";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_Article_Native")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_Article_Native")
        });
    }

    <
    /script> <
    /div></div > < div class = "hide-moblie mid-arti-ad" > < div id = "Moneycontrol/MC_News/MC_News_Internal_OutStream" >
    <
    script type = "text/javascript" >
    var width = window.innerWidth || document.documentElement.clientWidth;
    adKey = "Moneycontrol/MC_News/MC_News_Internal_OutStream";
    if (width >= 768 && adKey.indexOf("Moneycontrol") != -1 && adKey.indexOf("Moneycontrol_Mobile_WAP") < 0) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_OutStream")
        });
    }

    if (width <= 768 && adKey.indexOf("Moneycontrol_Mobile_WAP") != -1) {
        googletag.cmd.push(function() {
            googletag.display("Moneycontrol/MC_News/MC_News_Internal_OutStream")
        });
    }

    <
    /script> <
    /div></div > < script >
    date = new Date();
    date.setTime(date.getTime() + (1 * 24 * 60 * 60 * 1000));
    $.cookie("dfp_cookie_article", "Y1", {
        expires: date,
        path: "/",
        domain: ".moneycontrol.com"
    }); < /script><p>Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 in December 2017.</p > < p > Cadila Health shares closed at 317.95 on February 06, 2019(NSE) and has given - 16.39 % returns over the last 6 months and - 21.40 % over the last 12 months. < /p></div >
]

Фактический желаемый результат будет: - Чистый объем продаж на уровне 3 577,90 крор в декабре 2018 года по сравнению с 9,77% по сравнению с рупиями.3 259,60 крор в декабре 2017 года.

Квартальная чистая прибыль в рупиях.510,70 крор в декабре 2018 года снизился на 6% по сравнению с рупиями.543,30 крор в декабре 2017 года. EBITDA составляет Rs.870,90 крор в декабре 2018 года снизился на 1,29% от рупий.882,30 крор в декабре 2017 года. Cadila Health EPS снизился до рупий.4.99 в декабре 2018 от рупий5,31 в декабре 2017 года.

Акции Cadila Health закрылись на 317,95 6 февраля 2019 года (NSE) и дали -16,39% прибыли за последние 6 месяцев и -21,40% за последние 12 месяцев.

Редактировать: во время написания этого вывода я понял, что все новости, которые мне нужны, содержатся в тегах "p", поэтому мне нужно было бы перехватить новостную статью в другой объект и прочитать только теги "p", может кто-нибудь мне помочькто я могу пойти делать это?

Ответы [ 2 ]

0 голосов
/ 19 февраля 2019

Вы также можете получить этот формат JSON в тегах <script>.

import requests
import bs4
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'
html = requests.get(p, headers=headers)
soup1 = bs4.BeautifulSoup(html.text,'html.parser')
date = soup1.find_all("div", {"class":"arttidate"})
print(date)
scripts = soup1.find_all("script", {'type':'application/ld+json'})

jsonObj = None

for script in scripts:
    if "articleBody" in script.text:
        jsonStr = script.text.strip()
        jsonObj = json.loads(jsonStr, strict=False)

        article = jsonObj[0]['articleBody']

print(article)

Вывод:

Reported Consolidated quarterly numbers for Cadila Healthcare are:

Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 crore in December 2017.

Quarterly Net Profit at Rs. 510.70 crore in December 2018 down 6% from Rs. 543.30 crore in December 2017.

EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. 882.30 crore in December 2017.

Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 in December 2017.

Cadila Health shares closed at 317.95 on February 06, 2019 (NSE) and has given -16.39% returns over the last 6 months and -21.40% over the last 12 months.









Cadila Healthcare


Consolidated Quarterly Results
in Rs. Cr.











Dec'18
Sep'18
Dec'17


Net Sales/Income from operations
3,516.10
2,844.10
3,191.80


Other Operating Income
61.80
117.10
67.80


Total Income From Operations
3,577.90
2,961.20
3,259.60


EXPENDITURE


Consumption of Raw Materials
590.50
658.30
661.00


Purchase of Traded Goods
620.50
465.10
495.90


Increase/Decrease in Stocks
141.20
-131.50
-32.30


Power &amp;amp;amp; Fuel
--
--
--


Employees Cost
524.00
521.20
460.80


Depreciation
153.70
147.50
147.30


Excise Duty
--
--
--


Admin. And Selling Expenses
--
--
--


R &amp;amp;amp; D Expenses
--
--
--


Provisions And Contingencies
--
--
--


Exp. Capitalised
--
--
--


Other Expenses
861.80
760.30
833.00


P/L Before Other Inc., Int., Excpt. Items &amp;amp;amp; Tax
686.20
540.30
693.90


Other Income
31.00
30.40
41.10


P/L Before Int., Excpt. Items &amp;amp;amp; Tax
717.20
570.70
735.00


Interest
45.50
35.70
13.50


P/L Before Exceptional Items &amp;amp;amp; Tax
671.70
535.00
721.50


Exceptional Items
--
--
--


P/L Before Tax
671.70
535.00
721.50


Tax
158.60
124.70
178.60


P/L After Tax from Ordinary Activities
513.10
410.30
542.90


Prior Year Adjustments
--
--
--


Extra Ordinary Items
--
--
--


Net Profit/(Loss) For the Period
513.10
410.30
542.90


Minority Interest
-10.90
-10.70
-10.10


Share Of P/L Of Associates
8.50
17.90
10.50


Net P/L After M.I &amp;amp;amp; Associates
510.70
417.50
543.30


Equity Share Capital
102.40
102.40
102.40


Reserves Excluding Revaluation Reserves
--
--
--


Equity Dividend Rate (%)
--
--
--


EPS Before Extra Ordinary


Basic EPS
4.99
4.08
5.31


Diluted EPS
4.99
4.08
5.31


EPS After Extra Ordinary


Basic EPS
4.99
4.08
5.31


Diluted EPS
4.99
4.08
5.31


Public Share Holding


No Of Shares (Crores)
--
--
--


Share Holding (%)
--
--
--


Promoters and Promoter Group Shareholding


a) Pledged/Encumbered


- Number of shares (Crores)
--
--
--


- Per. of shares (as a % of the total sh. of prom. and promoter group)
--
--
--


- Per. of shares (as a % of the total Share Cap. of the company)
--
--
--


b) Non-encumbered


- Number of shares (Crores)
--
--
--


- Per. of shares (as a % of the total sh. of prom. and promoter group)
--
--
--


- Per. of shares (as a % of the total Share Cap. of the company)
--
--
--


Source :  Dion Global Solutions Limited
0 голосов
/ 19 февраля 2019

Я думаю, вам просто нужен текст внутри другого тега <p>

. Для этого вы можете найти все теги <p> и применить к нему get_text():

p = 'https://www.moneycontrol.com/news/business/earnings/cadila-health-consolidated-december-2018-net-sales-at-rs-3577-90-crore-up-9-77-y-o-y-3497711.html'

html = requests.get(p)
soup1 = BeautifulSoup(html.text,'html.parser')

para = soup1.find_all('p')

result = []
for p in para:
    result.append(p.get_text())

print(result)

Вывод будет:

['Reported Consolidated quarterly numbers for Cadila Healthcare are:',
 'Net Sales at Rs 3,577.90 crore in December 2018 up 9.77% from Rs. 3,259.60 '
 'crore in December 2017.',
 'Quarterly Net Profit at Rs. 510.70 crore in December 2018 down 6% from Rs. '
 '543.30 crore in December 2017.',
 'EBITDA stands at Rs. 870.90 crore in December 2018 down 1.29% from Rs. '
 '882.30 crore in December 2017.',
 'Cadila Health EPS has decreased to Rs. 4.99 in December 2018 from Rs. 5.31 '
 'in December 2017.',
 'Cadila Health shares closed at 317.95 on February 06, 2019 (NSE) and has '
 'given -16.39% returns over the last 6 months and -21.40% over the last 12 '
 'months.',
 'Podcast | NSE Invest O Cast episode 5: Harsh Roongta on the benefits of SIP',
 ' Copyright © e-Eighteen.com Ltd. All rights reserved. Reproduction of news '
 'articles, photos, videos or any other content in whole or in part in any '
 'form \r\n'
 '        or medium without express writtern permission of moneycontrol.com is '
 'prohibited.',
 '\n'
 ' Copyright © e-Eighteen.com Ltd All rights resderved. Reproduction of news '
 'articles, photos, videos or any other content in whole or in part in any '
 'form or medium without express writtern permission of moneycontrol.com is '
 'prohibited.\r\n'
 '\t\t']

Вы можете, наконец, пропустить некоторые из них или применить к ним регулярное выражение

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...