Как я могу исправить этот парсинг сайта BeautifulSoup для NHL Reference? - PullRequest
1 голос
/ 05 августа 2020
• 1000 . Я также пробовал использовать решения, которые нашел в других сообщениях, но безуспешно. Любая помощь приветствуется. Спасибо!
import requests
from bs4 import BeautifulSoup
import pandas as pd

dict={}
for i in range (2010,2020):
    year = str(i)
    source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
    soup = BeautifulSoup(source,features='lxml')

     #identifying table in html
    table = soup.find('table', id="stats")
    #grabbing <tr> tags in html
    rows = table.findAll("tr")
    #creating passable values for each "stat" in td tag
    data_stats = [
        "player",
        "age",
        "team_id",
        "pos",
        "games_played",
        "goals",
        "assists",
        "points",
        "plus_minus",
        "pen_min",
        "ps",
        "goals_ev",
        "goals_pp",
        "goals_sh",
        "goals_gw",
        "assists_ev",
        "assists_pp",
        "assists_sh",
        "shots",
        "shot_pct",
        "time_on_ice",
        "time_on_ice_avg",
        "blocks",
        "hits",
        "faceoff_wins",
        "faceoff_losses",
        "faceoff_percentage"
    ]


    for rownum in rows:
        # grabbing player name and using as key
        filter = { "data-stat":'player' }
        cell = rows[3].findAll("td",filter)
        nameval = cell[0].string
        list = []
        for data in data_stats:
            #iterating through data_stat to grab values
            filter = { "data-stat":data }
            cell = rows[3].findAll("td",filter)
            value = cell[0].string
            list.append(value)

        dict[nameval] = list
        dict[nameval].append(year)

# conversion to numeric values and creating dataframe
columns = [
 "player",
 "age",
 "team_id",
 "pos",
 "games_played",
 "goals",
 "assists",
 "points",
 "plus_minus",
 "pen_min",
 "ps",
 "goals_ev",
 "goals_pp",
 "goals_sh",
 "goals_gw",
 "assists_ev",
 "assists_pp",
 "assists_sh",
 "shots",
 "shot_pct",
 "time_on_ice",
 "time_on_ice_avg",
 "blocks",
 "hits",
 "faceoff_wins",
 "faceoff_losses",
 "faceoff_percentage",
 "year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')

print(df)

Результат

Craig Adams              Craig Adams   32  ...               43.9  2010
Luke Adam                  Luke Adam   22  ...              100.0  2013
Justin Abdelkader  Justin Abdelkader   29  ...               29.4  2017
Will Acton                Will Acton   27  ...               50.0  2015
Noel Acciari            Noel Acciari   24  ...               44.1  2016
Pontus Aberg            Pontus Aberg   25  ...               10.5  2019

[6 rows x 28 columns]

Ответы [ 2 ]

1 голос
/ 05 августа 2020

Я бы просто использовал pandas '.read_html(), он выполняет тяжелую работу по синтаксическому анализу таблиц за вас (использует BeautifulSoup под капотом)

Код:

import pandas as pd

result = pd.DataFrame()
for i in range (2010,2020):
    print(i)
    year = str(i)
    url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
    
    #source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
    df = pd.read_html(url,header=1)[0]
    df['year'] = year
    result = result.append(df, sort=False)
    
result = result[~result['Age'].str.contains("Age")]    
result = result.reset_index(drop=True)

Затем вы можете сохранить в файл с помощью result.to_csv('filename.csv',index=False)

Вывод:

print (result)
        Rk             Player Age   Tm Pos  GP  ...  BLK  HIT  FOW  FOL    FO%  year
0        1  Justin Abdelkader  22  DET  LW  50  ...   20  152  148  170   46.5  2010
1        2        Craig Adams  32  PIT  RW  82  ...   58  193  243  311   43.9  2010
2        3   Maxim Afinogenov  30  ATL  RW  82  ...   21   32    1    2   33.3  2010
3        4     Andrew Alberts  28  TOT   D  76  ...   88  216    0    1    0.0  2010
4        4     Andrew Alberts  28  CAR   D  62  ...   67  172    0    0    NaN  2010
5        4     Andrew Alberts  28  VAN   D  14  ...   21   44    0    1    0.0  2010
6        5  Daniel Alfredsson  37  OTT  RW  70  ...   36   41   14   25   35.9  2010
7        6        Bryan Allen  29  FLA   D  74  ...  137  120    0    0    NaN  2010
8        7        Cody Almond  20  MIN   C   7  ...    5    7   18   12   60.0  2010
9        8        Karl Alzner  21  WSH   D  21  ...   21   15    0    0    NaN  2010
10       9     Artem Anisimov  21  NYR   C  82  ...   41   45  310  380   44.9  2010
11      10       Nik Antropov  29  ATL   C  76  ...   35   82  481  627   43.4  2010
12      11    Colby Armstrong  27  ATL  RW  79  ...   29   74   10   10   50.0  2010
13      12    Derek Armstrong  36  STL   C   6  ...    0    4    7    8   46.7  2010
14      13       Jason Arnott  35  NSH   C  63  ...   17   24  526  551   48.8  2010
15      14        Dean Arsene  29  EDM   D  13  ...   13   18    0    0    NaN  2010
16      15   Evgeny Artyukhin  26  TOT  RW  54  ...   10  127    1    1   50.0  2010
17      15   Evgeny Artyukhin  26  ANA  RW  37  ...    8   90    0    1    0.0  2010
18      15   Evgeny Artyukhin  26  ATL  RW  17  ...    2   37    1    0  100.0  2010
19      16        Arron Asham  31  PHI  RW  72  ...   16   92    2   11   15.4  2010
20      17      Adrian Aucoin  36  PHX   D  82  ...   67  131    1    0  100.0  2010
21      18       Keith Aucoin  31  WSH   C   9  ...    0    2   31   25   55.4  2010
22      19         Sean Avery  29  NYR   C  69  ...   17  145    4   10   28.6  2010
23      20       David Backes  25  STL  RW  79  ...   60  266  504  561   47.3  2010
24      21    Mikael Backlund  20  CGY   C  23  ...    4   12  100   86   53.8  2010
25      22  Nicklas Backstrom  22  WSH   C  82  ...   61   90  657  660   49.9  2010
26      23        Josh Bailey  20  NYI   C  73  ...   36   67  171  255   40.1  2010
27      24      Keith Ballard  27  FLA   D  82  ...  201  156    0    0    NaN  2010
28      25         Krys Barch  29  DAL  RW  63  ...   13  120    0    3    0.0  2010
29      26         Cam Barker  23  TOT   D  70  ...   53   75    0    0    NaN  2010
   ...                ...  ..  ...  ..  ..  ...  ...  ...  ...  ...    ...   ...
10251  885      Chris Wideman  29  TOT   D  25  ...   26   35    0    0    NaN  2019
10252  885      Chris Wideman  29  OTT   D  19  ...   25   26    0    0    NaN  2019
10253  885      Chris Wideman  29  EDM   D   5  ...    1    7    0    0    NaN  2019
10254  885      Chris Wideman  29  FLA   D   1  ...    0    2    0    0    NaN  2019
10255  886    Justin Williams  37  CAR  RW  82  ...   32   55   92  150   38.0  2019
10256  887       Colin Wilson  29  COL   C  65  ...   31   55   20   32   38.5  2019
10257  888     Garrett Wilson  27  PIT  LW  50  ...   16  114    3    4   42.9  2019
10258  889       Scott Wilson  26  BUF   C  15  ...    2   29    1    2   33.3  2019
10259  890         Tom Wilson  24  WSH  RW  63  ...   52  200   29   24   54.7  2019
10260  891     Luke Witkowski  28  DET   D  34  ...   27   67    0    0    NaN  2019
10261  892  Christian Wolanin  23  OTT   D  30  ...   31   11    0    0    NaN  2019
10262  893         Miles Wood  23  NJD  LW  63  ...   27   97    0    2    0.0  2019
10263  894      Egor Yakovlev  27  NJD   D  25  ...   22   12    0    0    NaN  2019
10264  895    Kailer Yamamoto  20  EDM  RW  17  ...   11   18    0    0    NaN  2019
10265  896       Keith Yandle  32  FLA   D  82  ...   76   47    0    0    NaN  2019
10266  897        Pavel Zacha  21  NJD   C  61  ...   24   68  348  364   48.9  2019
10267  898       Filip Zadina  19  DET  RW   9  ...    3    6    3    3   50.0  2019
10268  899     Nikita Zadorov  23  COL   D  70  ...   67  228    0    0    NaN  2019
10269  900     Nikita Zaitsev  27  TOR   D  81  ...  151  139    0    0    NaN  2019
10270  901       Travis Zajac  33  NJD   C  80  ...   38   66  841  605   58.2  2019
10271  902       Jakub Zboril  21  BOS   D   2  ...    0    3    0    0    NaN  2019
10272  903     Mika Zibanejad  25  NYR   C  82  ...   66  134  830  842   49.6  2019
10273  904    Mats Zuccarello  31  TOT  LW  48  ...   43   57   10   20   33.3  2019
10274  904    Mats Zuccarello  31  NYR  LW  46  ...   42   57   10   20   33.3  2019
10275  904    Mats Zuccarello  31  DAL  LW   2  ...    1    0    0    0    NaN  2019
10276  905       Jason Zucker  27  MIN  LW  81  ...   38   87    2   11   15.4  2019
10277  906     Valentin Zykov  23  TOT  LW  28  ...    6   26    2    7   22.2  2019
10278  906     Valentin Zykov  23  CAR  LW  13  ...    2    6    2    6   25.0  2019
10279  906     Valentin Zykov  23  VEG  LW  10  ...    3   18    0    1    0.0  2019
10280  906     Valentin Zykov  23  EDM  LW   5  ...    1    2    0    0    NaN  2019

[10281 rows x 29 columns]
0 голосов
/ 05 августа 2020

Очистка сильно отформатированных таблиц с Beautiful Soup явно болезненна (не для bash в Beautiful Soup, это замечательно для нескольких случаев использования). Есть небольшой прием, который я использую для очистки данных, окруженных плотной разметкой, если вы хотите быть немного утилитарным:

1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv

Input heavily formatted data surrounded with dense HTML Output minimally formatted data in csv

It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!

As reference, I've linked my own conversions below.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...