Question

Привет , я следил и понял эту статью о том, как читать контент с сайтов, и он работал отлично: geeksforgeeks.org: чтение содержимого выбранной веб-страницы с помощью Python Web Scraping

Но когда я изменил свой код для работы с другим сайтом, он не возвращает никакого значения. Я пытаюсь получить значения Value1, Value2 и т. Д., Как показано ниже.

Обратите внимание: чтение материалов с этой веб-страницы разрешено законом.

import requests 
from bs4 import BeautifulSoup 

# the target we want to open     
url='https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'

#open with GET method 
resp=requests.get(url) 

#http_respone 200 means OK status 
if resp.status_code==200: 
    print("Successfully opened the web page") 
    print("The news are as follow :-\n") 

    # we need a parser,Python built-in HTML parser is enough . 
    soup=BeautifulSoup(resp.text,'html.parser')     

    # l is the list which contains all the text i.e news  
    l=soup.find("tr","spec-directory-entry daisy-table__row fade fade--show") 

    #now we want to print only the text part of the anchor. 
    #find all the elements of a, i.e anchor 
    for i in l: 
        print(i.text) 
else: 
    print("Error")

Вот исходный код сайта:

<tr class="spec-directory-entry daisy-table__row fade fade--show">
    <a href="/livestream" class="daisy-link spec-profile-name">Value1</a>
<tr class="spec-directory-entry daisy-table__row fade fade--show">
    <a href="/livestream" class="daisy-link spec-profile-name">Value2</a>
<tr class="spec-directory-entry daisy-table__row fade fade--show">
.
.
.

Khalid Ali · Answer 1 · 16 апреля 2019

JavaScript необходим для отображения содержимого веб-страницы.Использование службы prerenderio - это простой / легкий способ получить данные, которые вы ищете со страницы.

import requests 
from bs4 import BeautifulSoup 

# the target we want to open
# changed to use prerenderio service 
url='http://service.prerender.io/https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'

#open with GET method 
resp=requests.get(url) 

#http_respone 200 means OK status 
if resp.status_code==200: 
    print("Successfully opened the web page") 
    print("The news are as follow :-\n") 

    # we need a parser,Python built-in HTML parser is enough . 
    soup=BeautifulSoup(resp.text,'html.parser')     

    # l is the list which contains all the text i.e news  
    l=soup.find("tr","spec-directory-entry daisy-table__row fade fade--show") 

    #now we want to print only the text part of the anchor. 
    #find all the elements of a, i.e anchor 
    for i in l: 
        print(i.text) 
else: 
    print("Error")

Возвращенные данные из приведенного выше кода:

Successfully opened the web page
The news are as follow :-

LivestreamManaged
04 / 2019
73
$100
$150-$250

Отредактировано: Ответ на Комментарий Ахмада

Вот код, позволяющий получить значения только для строки таблицы "Прямая трансляция".

import requests 
from bs4 import BeautifulSoup 

# the target we want to open
# changed to use prerenderio service 
url='http://service.prerender.io/https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at'

#open with GET method 
resp=requests.get(url) 

#http_respone 200 means OK status 
if resp.status_code==200: 
    print("Successfully opened the web page") 
    print("The news are as follow :-\n") 

    # we need a parser,Python built-in HTML parser is enough . 
    soup=BeautifulSoup(resp.text,'html.parser')     

    # l is the list which contains all "tr" tags  
    l=soup.findAll("tr","spec-directory-entry daisy-table__row fade fade--show")

    # looping through the list of table rows
    for i in l:
        # checking if the current row is for 'Livestream'
        if i.find('a').text == 'Livestream':
          # printing the row's values except the first "td" tag
          for e in i.findAll('td')[1:]:
            print(e.text)
else: 
    print("Error")

Результат:

Successfully opened the web page
The news are as follow :-

04 / 2019
73
$100
$150-$250

KunduK · Answer 2 · 16 апреля 2019

Похоже, что JS рендерит на страницу. Вы можете использовать и селен, и Красивый суп, чтобы получить значение.

from selenium import webdriver
import time
from bs4 import BeautifulSoup

driver=webdriver.Chrome()
driver.get("https://hackerone.com/directory?offers_bounties=true&asset_type=URL&order_direction=DESC&order_field=started_accepting_at")
time.sleep(5)
html=driver.page_source
soup=BeautifulSoup(html,'html.parser')
for a in soup.select("a.spec-profile-name[href='\/livestream']"):
    print(a.text)

joe-fivefifty · Answer 3 · 16 апреля 2019

Глядя на то, что запрос на самом деле получает, кажется, что эта страница опирается на динамический контент.Следующий текст возвращается в вашем запросе:

It looks like your JavaScript is disabled. To use HackerOne, enable JavaScript in your browser and refresh this page.

Вы получаете «TypeError: объект« NoneType »не повторяется», потому что без Javascript нет элементов «tr», которые BeautifulSoup мог бы найти и перебрать.Вам нужно будет использовать что-то вроде селена для симуляции браузера, работающего на Javascript, чтобы получить ожидаемый HTML-код.

beautifulsoup4 не возвращает контент

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

beautifulsoup4 не возвращает контент

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов