Выявление отсутствующих тегов на веб-сайте с помощью BeautifulSoup в Python - PullRequest
2 голосов
/ 06 августа 2020

Я работаю над проектом, в котором я пытаюсь извлечь все URL-адреса с главной страницы веб-страницы CNN / Politics. Я просмотрел исходный код html и обнаружил, что ссылки на статьи расположены в теге li.

Я получаю весь контент по указанному тегу, выполнив следующие действия:

url = 'https://edition.cnn.com/politics'

r1 = requests.get(url)
coverpage = r1.content

soup = BeautifulSoup(coverpage, 'lxml')

links = soup.find_all('li')

Это дает мне список объектов, похожих на этот; «Sitemap »

Я не указываю класс, поскольку класс меняется с url на url.

Однако я не получаю все объекты li при запуске этого кода. При проверке исходного кода веб-страницы обнаруживается гораздо больше объектов li с именем класса «cd blabla », но beautifulsoup, похоже, не распознает их. Я не знаю, встроены ли они каким-то образом в другой тег или почему они не извлекаются.

Я sh, чтобы извлечь ссылки на статьи, к которым можно перейти с титульной страницы политики. Как я могу go решить эту проблему? Есть ли более простой способ найти ссылки на другие статьи на странице.

Ответы [ 3 ]

2 голосов
/ 06 августа 2020

Для работы со страницами, содержащими js для загрузки элемента. Попробуйте использовать селен, и в большинстве случаев он может работать. Вам необходимо go через документацию https://selenium-python.readthedocs.io/index.html, такую ​​как установка и драйвер.

from selenium import webdriver
from bs4 import BeautifulSoup

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

url = "https://edition.cnn.com/politics"
driver.get(url)
req = driver.page_source
driver.close()
soup = BeautifulSoup(req, "html.parser")

result = soup.find_all(class_="cd__headline-text")


for i in result:
    print(i.text)

Вывод:

Trump's mail-in voting falsehoods are part of a wide campaign to discredit the election
Fact check: At briefing, Trump continues to mislead on coronavirus, mail-in voting and Beirut
US accuses Russia of conducting sophisticated disinformation and propaganda campaign  
Fact check: Trump ad edits out microphone and trees from Biden photo to make him seem alone in basement
White House chief of staff floats executive action on unemployment and evictions if Congress can't strike deal
Trump campaign calls for a fourth presidential debate, citing early voting
Fact Check: With vote by mail expansion, can Nevada voters cast ballots after Election Day?
Trump bests Biden in July fundraising but money gap between the campaigns has essentially closed
New York Times: Prosecutors subpoenaed Trump's bank in criminal inquiry 
Analysis: But, seriously -- what is this country going to do with its kids this fall?
Analysis: This week's 'smooth' primaries almost felt normal. Here's why.
Brianna Keilar debunks Trump campaign official: You've got to shovel B.S.
Illinois Republican congressman tests positive for coronavirus
Former Army Delta Force officer, US ambassador sign secretive contract to develop Syrian oil fields
Supreme Court lifts lower court order that would have required more Covid-related safety measures in California jail
Ex-acting AG Sally Yates defends FBI investigation into Flynn, calls Barr move to drop charges 'highly irregular'
Esper says 'most believe' Beirut explosion 'was an accident' after Trump claimed it was an attack
Fact check: Trump makes at least 20 false claims in Fox & Friends interview
Trump trashes Obama's Lewis eulogy that pressed for voting rights
Trump still not grasping the severity of the pandemic, source tells CNN 
Republican senators grow anxious over direction of stimulus talks with no deal in sight
Joe Biden will no longer travel to Milwaukee to accept Democratic nomination
Analysis: Trump's interview debacle sends a warning for the fall campaign  
Fauci says US has suffered from pandemic 'as much or worse than anyone' 
Primary results: Key takeaways from Kansas
CNN holds elected officials and candidates accountable. View our Facts First database
Seven governors join deal in pursuit of first multistate coordinated testing strategy
Hogan overrules Maryland county order delaying in-person education at private schools, including Barron Trump's 
Birx defends herself as Pelosi accuses Trump administration of spreading disinformation on Covid-19
See latest Trump and Biden head-to-head polling
Top Senate Republican pushes back against Trump's unsubstantiated claims mail-in-voting leads to mass fraud
Republican operatives are helping Kanye West get on general election ballots
Progressive who unseated longtime Democratic congressman says 'people are looking for a fighter right now'
Trump said he may deliver convention speech from White House
Biden clarifies he has not taken cognitive test
Fact check: Biden says he hasn't taken a cognitive test. Is he flip-flopping?
WNBA players wear shirts supporting Sen. Kelly Loeffler's challenger -- including some from team she co-owns
Trump campaign sues Nevada over plan to mail ballots to all registered voters
Analysis: Trump may finally realize he's suppressing his own vote
Trump continues to lose ground in 2020 election as nation grapples with coronavirus 

2 голосов
/ 06 августа 2020

Это хороший сайт. Когда вы go подробно рассказываете, как веб-сайт загружает данные, и когда вы видите исходный код веб-сайта, все данные хранятся внутри тега скрипта в форме Javascript Object. Это не JSON. Если вы извлечете данные внутри скрипта, вы получите все ссылки на статьи, изображения и т. Д. c ...

Поскольку это объект Javascript, вам нужны сторонние библиотеки для преобразования в json. Я использовал библиотеку demjson для выполнения этой работы - https://github.com/dmeranda/demjson

Приведенный ниже сценарий сохраняет данные в файл json. Как только у вас будет json, получить все ссылки не составит труда.

import requests, demjson, json
from bs4 import BeautifulSoup

res = requests.get("https://edition.cnn.com/politics")

soup = BeautifulSoup(res.text, "html.parser")

script = None
for i in soup.find_all("script"):
    if "window.CNN" in i.text:
        script = i.get_text(strip=True)

if script is None: print("No data found")
else:
    data = script.partition("CNN.contentModel")[-1].partition("FAVE.settings")[0]
    json_data = demjson.decode(data[data.index('{'):-1])

    with open("data.json", "w") as f:
        json.dump(json_data, f)

Результат:

{
    "hasVideo": false,
    "layout": "no-rail",
    "vertical": "politics",
    "sectionName": "politics",
    "pageType": "section",
    "env": "prod",
    "type": "page",
    "analytics": {
        "pageTop": {},
        "headline": "",
        "author": "",
        "showName": "",
        "subSectionName": "",
        "isArticleVideoCollection": false,
        "publishDate": "2014-02-27T01:35:32Z",
        "lastUpdatedDate": "2020-08-06T09:31:15Z",
        "pageBranding": "10-minute-preview",
        "cep_topics": {
            "brsf": [],
            "buzz": [],
            "iabt": [],
            "sent": [
                "16B6"
            ],
            "tags": [],
            "shortSource": "se_politics",
            "source": "section_politics"
        },
        "chartbeat": {
            "sections": ""
        },
        "branding_content_page": "10-minute-preview",
        "branding_content_zone": [
            "default"
        ],
        "branding_content_container": [
            "default"
        ],
        "branding_content_card": [
            ""
        ]
    },
    "edition": "international",
    "sourceId": "section_politics",
    "title": "CNNPolitics - Political News, Analysis and Opinion",
    "siblings": {
        "articleList": [
            {
                "uri": "/2020/08/06/politics/donald-trump-mail-in-voting-election/index.html",
                "headline": "Trump's mail-in voting falsehoods are part of a wide campaign to discredit the election",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
                "duration": "",
                "description": "<a href=\"http://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump's</a> barrage of <a href=\"http://www.cnn.com/2020/08/05/politics/fact-check-trump-fox-friends-pandemic-biden-protests/index.html\" target=\"_blank\">challenges to the reputation, structures and traditions</a> of elections is conjuring up a contentious and potentially constitutionally critical three-month period for America's democracy.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/donald-trump-press-briefing-beirut-coronavirus-voting-fact-check/index.html",
                "headline": "Fact check: At briefing, Trump continues to mislead on coronavirus, mail-in voting and Beirut",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200805203446-02-donald-trump-0805-small-11.jpg",
                "duration": "",
                "description": "President Donald Trump ended his Wednesday much like he began it, by repeating falsehood after falsehood.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/state-department-russian-disinformation-report/index.html",
                "headline": "US accuses Russia of conducting sophisticated disinformation and propaganda campaign  ",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/170626163907-russia-dnc-hacking-ron-2-00000808-small-11.jpg",
                "duration": "",
                "description": "A <a href=\"https://content.govdelivery.com/attachments/USSTATEBPA/2020/08/05/file_attachments/1512230/Pillars%20of%20Russias%20Disinformation%20and%20Propaganda%20Ecosystem_08-04-20%20%281%29.pdf\" target=\"_blank\">new report</a> from the US State Department accuses Russia of conducting a sophisticated disinformation and propaganda campaign that uses a variety of approaches including Kremlin-aligned news sites to promote their agenda.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/fact-check-trump-ad-biden-basement-delaware-photos-iowa/index.html",
                "headline": "<strong>Fact check: </strong>Trump ad edits out microphone and trees from Biden photo to make him seem alone in basement",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200803235935-01-joe-biden-campaign-0720-small-11.jpg",
                "duration": "",
                "description": "A new <a href=\"https://www.youtube.com/watch?v=9PUfxZQa7WQ&feature=emb_title\" target=\"_blank\">ad</a> from President Donald Trump's campaign deceptively alters a photo of former Vice President Joe Biden campaigning outdoors in Iowa to make it seem as if Biden is \"hiding\" in his Delaware basement.",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/mark-meadows-unemployment-benefits-extension-coronavirus-relief-cnntv/index.html",
                "headline": "White House chief of staff floats executive action on unemployment and evictions if Congress can't strike deal",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/191219132522-03-mark-meadows-lead-image-small-11.jpg",
                "duration": "",
                "description": "White House chief of staff Mark Meadows said Wednesday that <a href=\"https://www.cnn.com/specials/politics/president-donald-trump-45\" target=\"_blank\">President Donald Trump</a> is prepared to take executive action on eviction protection and extending enhanced unemployment benefits if Congress isn't close to <a href=\"https://www.cnn.com/2020/08/05/politics/congress-stimulus-negotiations/index.html\" target=\"_blank\">a coronavirus recovery package</a> by Friday. ",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/trump-campaign-four-debates/index.html",
                "headline": "Trump campaign calls for a fourth presidential debate, citing early voting",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200709094609-trump-biden-split-small-1-1.jpg",
                "duration": "",
                "description": "<a href=\"https://www.cnn.com/election/2020/candidate/trump\" target=\"_blank\">Donald Trump's</a> presidential campaign called for an additional presidential debate in a letter to the Commission on Presidential Debates on Wednesday. ",
                "layout": ""
            },
            {
                "uri": "/2020/08/05/politics/schlapp-mail-voting-expansion-nevada-fact-check/index.html",
                "headline": "<strong>Fact Check: </strong>With vote by mail expansion, can Nevada voters cast ballots after Election Day?",
                "thumbnail": "//cdn.cnn.com/cnnnext/dam/assets/200610082429-voting-north-las-vegas-small-11.jpg",
                "duration": "",
                "description": "President Donald Trump reversed his stance on voting by mail Tuesday when he <a href=\"https://www.cnn.com/2020/08/04/politics/donald-trump-mail-in-voting-florida/index.html\" target=\"_blank\">tweeted</a> that doing so in Florida is \"safe and secure.\" When asked about the reversal later Tuesday afternoon, Trump seemed to imply that Republican-run states with existing mail-in voting programs were up to par, but Democratic states establishing or expanding mail-in voting during the pandemic were not.",
                "layout": ""
            },

...
...
...
1 голос
/ 06 августа 2020

Ваш код работает правильно, я пробовал, но проверьте, не пропущено ли вам какое-либо требование, например lxml, установлено вот что я сделал

from bs4 import BeautifulSoup
import requests

url = 'https://edition.cnn.com/politics'

r1 = requests.get(url)
soup = BeautifulSoup(r1.content, 'lxml')
li = soup.find_all('li')
print(li)

, и обратите внимание, что метод find_all возвращает массив i поэтому, если вы хотите по одному, вы можете просто l oop на нем и печатать каждое пение li, как показано ниже

for i in li:
    print(i.prettify())
...