HTML-разбор переполненной веб-страницы с BeautifulSoup - PullRequest
1 голос
/ 10 октября 2019

У меня проблемы с анализом баскетбольного мяча. Веб-страница, на которую я смотрю (https://www.basketball -reference.com / contract / IND.html ), выглядит очень раздутой, с множеством трекеров рекламы и посторонними меню. Я пытаюсь извлечь таблицу данных с именем "payroll", которая имеет следующий исходный код html (похоронен в куче другого мусора - или, по крайней мере, для меня это выглядит как мусор).

<table class="suppress_glossary sortable stats_table" id="contracts" data-cols-to-freeze=1><caption>Payroll Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

      <tr class="over_header">
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
         <th aria-label="" data-stat="header_salary" colspan="6" class=" over_header center" >Salary</th>
         <th aria-label="" data-stat="&nbsp;" colspan="2" class=" over_header center" >&nbsp;</th>
      </tr>



      <tr>
         <th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc center" >Player</th>
         <th aria-label="Age" data-stat="age_today" scope="col" class=" poptip center" >Age</th>
         <th aria-label="2019-20" data-stat="y1" scope="col" class=" poptip center" data-over-header="Salary" >2019-20</th>
         <th aria-label="2020-21" data-stat="y2" scope="col" class=" poptip center" data-over-header="Salary" >2020-21</th>
         <th aria-label="2021-22" data-stat="y3" scope="col" class=" poptip center" data-over-header="Salary" >2021-22</th>
         <th aria-label="2022-23" data-stat="y4" scope="col" class=" poptip center" data-over-header="Salary" >2022-23</th>
         <th aria-label="2023-24" data-stat="y5" scope="col" class=" poptip center" data-over-header="Salary" >2023-24</th>
         <th aria-label="2024-25" data-stat="y6" scope="col" class=" poptip center" data-over-header="Salary" >2024-25</th>
         <th aria-label="Signed Using" data-stat="signed_using" scope="col" class=" poptip sort_default_asc center" >Signed Using</th>
         <th aria-label="The amount of a player's remaining salary that is guaranteed." data-stat="remain_gtd" scope="col" class=" poptip center" data-tip="The amount of a player's remaining salary that is guaranteed." >Guaranteed</th>
      </tr>

   </thead>
   <tbody>
<tr ><th scope="row" class="left " data-append-csv="oladivi01" data-stat="player" csk="oladivi01" ><a href="/players/o/oladivi01.html">Victor Oladipo</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="21000000" >$21,000,000</td><td class="right " data-stat="y2" csk="21000000" >$21,000,000</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="42000000" >$42,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="brogdma01" data-stat="player" csk="brogdma01" ><a href="/players/b/brogdma01.html">Malcolm Brogdon</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="20000000" >$20,000,000</td><td class="right " data-stat="y2" csk="20700000" >$20,700,000</td><td class="right " data-stat="y3" csk="21700000" >$21,700,000</td><td class="right " data-stat="y4" csk="22600000" >$22,600,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="85000000" >$85,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="turnemy01" data-stat="player" csk="turnemy01" ><a href="/players/t/turnemy01.html">Myles Turner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="18000000" >$18,000,000</td><td class="right " data-stat="y2" csk="18000000" >$18,000,000</td><td class="right " data-stat="y3" csk="18000000" >$18,000,000</td><td class="right " data-stat="y4" csk="18000000" >$18,000,000</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st round pick</td><td class="right " data-stat="remain_gtd" csk="72000000" >$72,000,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="warretj01" data-stat="player" csk="warretj01" ><a href="/players/w/warretj01.html">T.J. Warren</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="10810000" >$10,810,000</td><td class="right " data-stat="y2" csk="11750000" >$11,750,000</td><td class="right " data-stat="y3" csk="12690000" >$12,690,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="35250000" >$35,250,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="lambje01" data-stat="player" csk="lambje01" ><a href="/players/l/lambje01.html">Jeremy Lamb</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="10500000" >$10,500,000</td><td class="right " data-stat="y2" csk="10500000" >$10,500,000</td><td class="right " data-stat="y3" csk="10500000" >$10,500,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="31500000" >$31,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mcderdo01" data-stat="player" csk="mcderdo01" ><a href="/players/m/mcderdo01.html">Doug McDermott</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="7333334" >$7,333,334</td><td class="right " data-stat="y2" csk="7333333" >$7,333,333</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="14666667" >$14,666,667</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidju01" data-stat="player" csk="holidju01" ><a href="/players/h/holidju01.html">Justin Holiday</a></th><td class="center " data-stat="age_today" >30</td><td class="right " data-stat="y1" csk="4767000" >$4,767,000</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Room Exception</td><td class="right " data-stat="remain_gtd" csk="4767000" >$4,767,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sabondo01" data-stat="player" csk="sabondo01" ><a href="/players/s/sabondo01.html">Domantas Sabonis</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="3529555" >$3,529,555</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round pick</td><td class="right " data-stat="remain_gtd" csk="3529555" >$3,529,555</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mccontj01" data-stat="player" csk="mccontj01" ><a href="/players/m/mccontj01.html">T.J. McConnell</a></th><td class="center " data-stat="age_today" >27</td><td class="right " data-stat="y1" csk="3500000" >$3,500,000</td><td class="right " data-stat="y2" csk="3500000" ><em>$3,500,000</em></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Cap Space</td><td class="right " data-stat="remain_gtd" csk="4500000" >$4,500,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="bitadgo01" data-stat="player" csk="bitadgo01" ><a href="/players/b/bitadgo01.html">Goga Bitadze</a></th><td class="center " data-stat="age_today" >20</td><td class="right " data-stat="y1" csk="2816760" >$2,816,760</td><td class="right " data-stat="y2" csk="2957520" >$2,957,520</td><td class="right salary-tm" data-stat="y3" csk="3098400" >$3,098,400</td><td class="right salary-tm" data-stat="y4" csk="4765339" >$4,765,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="5774280" >$5,774,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="leaftj01" data-stat="player" csk="leaftj01" ><a href="/players/l/leaftj01.html">T.J. Leaf</a></th><td class="center " data-stat="age_today" >22</td><td class="right " data-stat="y1" csk="2813280" >$2,813,280</td><td class="right salary-tm" data-stat="y2" csk="4326825" >$4,326,825</td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2813280" >$2,813,280</td></tr>
<tr ><th scope="row" class="left " data-append-csv="holidaa01" data-stat="player" csk="holidaa01" ><a href="/players/h/holidaa01.html">Aaron Holiday</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2239200" >$2,239,200</td><td class="right salary-tm" data-stat="y2" csk="2345640" >$2,345,640</td><td class="right salary-tm" data-stat="y3" csk="3980551" >$3,980,551</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >1st Round Pick</td><td class="right " data-stat="remain_gtd" csk="2239200" >$2,239,200</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sumneed01" data-stat="player" csk="sumneed01" ><a href="/players/s/sumneed01.html">Edmond Sumner</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="2000000" >$2,000,000</td><td class="right " data-stat="y2" csk="2160000" >$2,160,000</td><td class="right salary-tm" data-stat="y3" csk="2320000" >$2,320,000</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="4160000" >$4,160,000</td></tr>
<tr ><th scope="row" class="left " data-append-csv="sampsja02" data-stat="player" csk="sampsja02" ><a href="/players/s/sampsja02.html">JaKarr Sampson</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" csk="1737145" >$1,737,145</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1737145" >$1,737,145</td></tr>
<tr ><th scope="row" class="left " data-append-csv="johnsal02" data-stat="player" csk="johnsal02" ><a href="/players/j/johnsal02.html">Alize Johnson</a></th><td class="center " data-stat="age_today" >23</td><td class="right " data-stat="y1" csk="1416852" >$1,416,852</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right " data-stat="remain_gtd" csk="1416852" >$1,416,852</td></tr>
<tr ><th scope="row" class="left " data-append-csv="mitrona01" data-stat="player" csk="mitrona01" ><a href="/players/m/mitrona01.html">Naz Mitrou-Long</a></th><td class="center " data-stat="age_today" >26</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr ><th scope="row" class="left " data-append-csv="wilcocj01" data-stat="player" csk="wilcocj01" ><a href="/players/w/wilcocj01.html">C.J. Wilcox</a></th><td class="center " data-stat="age_today" >28</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="brimaam01" data-stat="player" csk="brimaam01" ><a href="/players/b/brimaam01.html">Amida Brimah</a></th><td class="center " data-stat="age_today" >25</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="gantja01" data-stat="player" csk="gantja01" ><a href="/players/g/gantja01.html">Jakeenan Gant</a></th><td class="center " data-stat="age_today" >23</td><td class="right iz" data-stat="y1" ></td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Minimum Salary</td><td class="right iz" data-stat="remain_gtd" ></td></tr>
<tr ><th scope="row" class="left " data-append-csv="bowenbr02" data-stat="player" csk="bowenbr02" ><a href="/players/b/bowenbr02.html">Brian Bowen</a></th><td class="center " data-stat="age_today" >21</td><td class="right " data-stat="y1" >&nbsp;</td><td class="right iz" data-stat="y2" ></td><td class="right iz" data-stat="y3" ></td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left " data-stat="signed_using" >Two-Way Contract</td><td class="right " data-stat="remain_gtd" >&nbsp;</td></tr>
<tr class='thead'><td colspan='10'></td></tr>
<tr class="partial_table" ><th scope="row" class="left " data-append-csv="ellismo01" data-stat="player" csk="ellismo01" ><a href="/players/e/ellismo01.html"><em>Monta Ellis</em></a></th><td class="center " data-stat="age_today" >33</td><td class="right " data-stat="y1" csk="2245400" >$2,245,400</td><td class="right " data-stat="y2" csk="2245400" >$2,245,400</td><td class="right " data-stat="y3" csk="2245400" >$2,245,400</td><td class="right iz" data-stat="y4" ></td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" csk="6736200" >$6,736,200</td></tr>

   </tbody>
   <tfoot><tr ><th scope="row" class="left " data-stat="player" >Team Totals</th><td class="center iz" data-stat="age_today" ></td><td class="right " data-stat="y1" >$114,708,526</td><td class="right " data-stat="y2" >$106,818,718</td><td class="right " data-stat="y3" >$74,534,351</td><td class="right " data-stat="y4" >$45,365,339</td><td class="right iz" data-stat="y5" ></td><td class="right iz" data-stat="y6" ></td><td class="left iz" data-stat="signed_using" ></td><td class="right " data-stat="remain_gtd" >$318,090,179</td></tr>

   </tfoot>

</table>

Когда я запускаю следующий код Python, переменная l равна нулю.

#import beautiful soup, requests, time, pandas
from bs4 import BeautifulSoup
import requests

#assign the URL for contract scraping
url = 'https://www.basketball-reference.com/teams/IND.html'

#pull html from page
page = requests.get(url)

#format html using BS
soup = BeautifulSoup(page.text, "html.parser")

#take only table rows
l = soup.find_all('a',{'class':'left'})

print(l)

Мне интересно, если у меня нет правильного аргумента для класса. Или есть еще одна причина, по которой print (l) возвращает []?

Ответы [ 2 ]

1 голос
/ 11 октября 2019

Вы говорите, что хотите таблицу заработной платы. Вы можете использовать панд read_html для этого

import pandas as pd

table = pd.read_html('https://www.basketball-reference.com/contracts/IND.html')[0]
print(table)
1 голос
/ 10 октября 2019

Левый класс, за которым вы следите, не связан с тегом привязки, поэтому вы получаете нулевую запись. Попробуйте приведенный ниже код.

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=soup.select('.left > a')
print(l)

Если вы хотите получить имя игрока.

from bs4 import BeautifulSoup
import requests
r=requests.get("https://www.basketball-reference.com/contracts/IND.html")
soup=BeautifulSoup(r.text,'html.parser')
l=[item.text for item in soup.select('.left > a')]
print(l)

Выход :

['Victor Oladipo', 'Malcolm Brogdon', 'Myles Turner', 'T.J. Warren', 'Jeremy Lamb', 'Doug McDermott', 'Justin Holiday', 'Domantas Sabonis', 'T.J. McConnell', 'Goga Bitadze', 'T.J. Leaf', 'Aaron Holiday', 'Edmond Sumner', 'JaKarr Sampson', 'Alize Johnson', 'Brian Bowen', 'Naz Mitrou-Long', 'C.J. Wilcox', 'Amida Brimah', 'Jakeenan Gant', 'Monta Ellis']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...