Вы можете извлечь формат json с тегом script и работать с ним:
url = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"
import requests
from bs4 import BeautifulSoup
import json
url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")
for script in soup(["script"]):
if 'window.SA = ' in script.text:
jsonStr = script.text.split('window.SA = ')[1]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonObj = json.loads(jsonStr)
title = jsonObj['pageConfig']['Data']['article']['title']
print (title)
Там много информации.И чтобы получить статью:
article = soup.find('div', {'itemprop':'articleBody'})
ps = article.find_all('p', {'class':'p p1'})
for para in ps:
print (para.text)
Выход:
The Boeing Bear Wakens
Артикул:
With the Boeing (NYSE:BA) 737 MAX fleet being grounded and deliveries to customers being halted, Boeing is feeling the heat from two sides. While insurers have part of the damages covered, it is unlikely that a multi-month grounding will be fully covered. Initially, it seemed that Boeing was looking for a relatively fast fix to minimize disruptions as it was relatively quick with presenting a fix to stakeholders. Based on that quick roll-out, it seemed that Boeing was looking to have the fleet back in the air within 3 months. However, as the fix got delayed and Boeing and the FAA came under international scrutiny, it seems that timeline has slipped significantly as additional improvements are to be made. Initially, I expected that Boeing would be cleared to send the 737 MAX back to service in June/July, signalling a 3-4-month grounding and expected that Boeing's delivery target for the full year would decline by 40 units.
Source: Everett Herald
On the 5th of April, Boeing announced that it would be reducing the production rate for the Boeing 737 temporarily, which is a huge decision:
As we continue to work through these steps, we're adjusting the 737 production system temporarily to accommodate the pause in MAX deliveries, allowing us to prioritize additional resources to focus on software certification and returning the MAX to flight. We have decided to temporarily move from a production rate of 52 airplanes per month to 42 airplanes per month starting in mid-April.
Вы также можете получить ответ jsonкомментариев:
url = 'https://seekingalpha.com/account/ajax_get_comments?id=4253393&type=Article&commentType=topLiked'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
jsonObj_comments = requests.get(url, headers=headers).json()
Что касается общего подхода, это будет сложно, поскольку каждый веб-сайт имеет свою собственную структуру, форматы, использование тегов и имен атрибутов и т. д. Однако я заметил, что обасайты, которые вы предоставляете, используют тег <p>
для своих статей, поэтому я полагаю, что вы можете извлечь текст из этих тегов.Однако при использовании универсального подхода вы получите несколько общих выводов, то есть у вас может быть слишком много текста или пропущены фрагменты в статье.
import requests
from bs4 import BeautifulSoup
url1 = "https://seekingalpha.com/article/4253393-boeing-bear-wakens"
url2 = "https://www.dqindia.com/accenture-helps-del-monte-foods-unlock-innovation-drive-business-growth-cloud/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
url = requests.get(url1, headers=headers)
html = url.text
soup = BeautifulSoup(html, "html.parser")
paragraphs = soup.find_all('p')
for p in paragraphs:
print (p.text)