Это хороший сайт. Когда вы go подробно рассказываете, как веб-сайт загружает данные, и когда вы видите исходный код веб-сайта, все данные хранятся внутри тега скрипта в форме Javascript Object
. Это не JSON
. Если вы извлечете данные внутри скрипта, вы получите все ссылки на статьи, изображения и т. Д. c ...
Поскольку это объект Javascript, вам нужны сторонние библиотеки для преобразования в json. Я использовал библиотеку demjson
для выполнения этой работы - https://github.com/dmeranda/demjson
Приведенный ниже сценарий сохраняет данные в файл json. Как только у вас будет json, получить все ссылки не составит труда.
import requests, demjson, json
from bs4 import BeautifulSoup
res = requests.get("https://edition.cnn.com/politics")
soup = BeautifulSoup(res.text, "html.parser")
script = None
for i in soup.find_all("script"):
if "window.CNN" in i.text:
script = i.get_text(strip=True)
if script is None: print("No data found")
else:
data = script.partition("CNN.contentModel")[-1].partition("FAVE.settings")[0]
json_data = demjson.decode(data[data.index('{'):-1])
with open("data.json", "w") as f:
json.dump(json_data, f)
Результат:
{
"hasVideo": false,
"layout": "no-rail",
"vertical": "politics",
"sectionName": "politics",
"pageType": "section",
"env": "prod",
"type": "page",
"analytics": {
"pageTop": {},
"headline": "",
"author": "",
"showName": "",
"subSectionName": "",
"isArticleVideoCollection": false,
"publishDate": "2014-02-27T01:35:32Z",
"lastUpdatedDate": "2020-08-06T09:31:15Z",
"pageBranding": "10-minute-preview",
"cep_topics": {
"brsf": [],
"buzz": [],
"iabt": [],
"sent": [
"16B6"
],
"tags": [],
"shortSource": "se_politics",
"source": "section_politics"
},
"chartbeat": {
"sections": ""
},
"branding_content_page": "10-minute-preview",
"branding_content_zone": [
"default"
],
"branding_content_container": [
"default"
],
"branding_content_card": [
""
]
},
"edition": "international",
"sourceId": "section_politics",
"title": "CNNPolitics - Political News, Analysis and Opinion",
"siblings": {
"articleList": [
{
"uri": "/2020/08/06/politics/donald-trump-mail-in-voting-election/index.html",
"headline": "Trump's mail-in voting falsehoods are part of a wide campaign to discredit the election",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
"duration": "",
"description": "<a href=\"http://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump's</a> barrage of <a href=\"http://www.cnn.com/2020/08/05/politics/fact-check-trump-fox-friends-pandemic-biden-protests/index.html\" target=\"_blank\">challenges to the reputation, structures and traditions</a> of elections is conjuring up a contentious and potentially constitutionally critical three-month period for America's democracy.",
"layout": ""
},
{
"uri": "/2020/08/05/politics/donald-trump-press-briefing-beirut-coronavirus-voting-fact-check/index.html",
"headline": "Fact check: At briefing, Trump continues to mislead on coronavirus, mail-in voting and Beirut",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
"duration": "",
"description": "President Donald Trump ended his Wednesday much like he began it, by repeating falsehood after falsehood.",
"layout": ""
},
{
"uri": "/2020/08/05/politics/state-department-russian-disinformation-report/index.html",
"headline": "US accuses Russia of conducting sophisticated disinformation and propaganda campaign ",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/170626163907-russia-dnc-hacking-ron-2-00000808-small-11.jpg",
"duration": "",
"description": "A <a href=\"https://content.govdelivery.com/attachments/USSTATEBPA/2020/08/05/file_attachments/1512230/Pillars%20of%20Russias%20Disinformation%20and%20Propaganda%20Ecosystem_08-04-20%20%281%29.pdf\" target=\"_blank\">new report</a> from the US State Department accuses Russia of conducting a sophisticated disinformation and propaganda campaign that uses a variety of approaches including Kremlin-aligned news sites to promote their agenda.",
"layout": ""
},
{
"uri": "/2020/08/05/politics/fact-check-trump-ad-biden-basement-delaware-photos-iowa/index.html",
"headline": "<strong>Fact check: </strong>Trump ad edits out microphone and trees from Biden photo to make him seem alone in basement",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200803235935-01-joe-biden-campaign-0720-small-11.jpg",
"duration": "",
"description": "A new <a href=\"https://www.youtube.com/watch?v=9PUfxZQa7WQ&feature=emb_title\" target=\"_blank\">ad</a> from President Donald Trump's campaign deceptively alters a photo of former Vice President Joe Biden campaigning outdoors in Iowa to make it seem as if Biden is \"hiding\" in his Delaware basement.",
"layout": ""
},
{
"uri": "/2020/08/05/politics/mark-meadows-unemployment-benefits-extension-coronavirus-relief-cnntv/index.html",
"headline": "White House chief of staff floats executive action on unemployment and evictions if Congress can't strike deal",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/191219132522-03-mark-meadows-lead-image-small-11.jpg",
"duration": "",
"description": "White House chief of staff Mark Meadows said Wednesday that <a href=\"https://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump</a> is prepared to take executive action on eviction protection and extending enhanced unemployment benefits if Congress isn't close to <a href=\"https://www.cnn.com/2020/08/05/politics/congress-stimulus-negotiations/index.html\" target=\"_blank\">a coronavirus recovery package</a> by Friday. ",
"layout": ""
},
{
"uri": "/2020/08/05/politics/trump-campaign-four-debates/index.html",
"headline": "Trump campaign calls for a fourth presidential debate, citing early voting",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200709094609-trump-biden-split-small-1-1.jpg",
"duration": "",
"description": "<a href=\"https://www.cnn.com/election/2020/candidate/trump\" target=\"_blank\">Donald Trump's</a> presidential campaign called for an additional presidential debate in a letter to the Commission on Presidential Debates on Wednesday. ",
"layout": ""
},
{
"uri": "/2020/08/05/politics/schlapp-mail-voting-expansion-nevada-fact-check/index.html",
"headline": "<strong>Fact Check: </strong>With vote by mail expansion, can Nevada voters cast ballots after Election Day?",
"thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200610082429-voting-north-las-vegas-small-11.jpg",
"duration": "",
"description": "President Donald Trump reversed his stance on voting by mail Tuesday when he <a href=\"https://www.cnn.com/2020/08/04/politics/donald-trump-mail-in-voting-florida/index.html\" target=\"_blank\">tweeted</a> that doing so in Florida is \"safe and secure.\" When asked about the reversal later Tuesday afternoon, Trump seemed to imply that Republican-run states with existing mail-in voting programs were up to par, but Democratic states establishing or expanding mail-in voting during the pandemic were not.",
"layout": ""
},
...
...
...