BeautifulSoup не умеет анализировать содержимое таблицы - PullRequest
0 голосов
/ 21 марта 2020

Я пытаюсь почистить данные из таблицы в ссылке. https://www.chp.ca.gov/traffic

Это то, что я пробовал, но не получилось.

wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.chp.ca.gov/traffic")
html = wd.page_source
soup = BeautifulSoup(html, "lxml")
l = []
div = soup.find("div" , {"id": "pnlIncidents"})
table = div.find("table", {"id":"gvIncidents"})
​
for row in table.findAll(a):
    l.append(row.text)

HTML

<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>

1 Ответ

1 голос
/ 21 марта 2020

Я разместил код запроса в качестве комментария, вы можете раскомментировать, чтобы получить данные непосредственно с сайта. Но так как веб-сайт недоступен в моем местоположении, я обработал его для вашего HTML, как показано ниже: -

#import requests
import pandas as pd

html = ''' 
<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>
'''
tables = pd.read_html(html)

#url = 'Enter your URL'
#html = requests.get(url).content 
df_list = pd.read_html(html)
df = df_list[-1]
print(df)

Редактирование с использованием BeautifulSoup

from bs4 import BeautifulSoup
html = ''' 
<div id="pnlIncidents" style="overflow-y:scroll;">


                    <div>
        <table tabindex="1" cellspacing="0" rules="rows" border="1" id="gvIncidents" style="border-collapse:collapse;">
            <tbody><tr class="gvHeader" style="white-space:nowrap;">
                <th tabindex="1" scope="col">Details</th><th tabindex="1" scope="col">No.</th><th tabindex="1" scope="col" style="white-space:nowrap;">Time</th><th tabindex="1" scope="col">Type</th><th tabindex="1" scope="col">Location</th><th tabindex="1" scope="col">Location Desc.</th><th tabindex="1" scope="col">Area</th>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$0')">Details</a></td><td>00082</td><td style="white-space:nowrap;">9:35 AM</td><td>Hit and Run w/Injuries</td><td>Nb Sr99 Jno Merle Haggard Dr</td><td>NB SR99 JNO Merle Haggard Dr</td><td>Bakersfield</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$1')">Details</a></td><td>00002</td><td style="white-space:nowrap;">12:00 AM</td><td>Traffic Advisory</td><td>Bakersfield Traffic Advisories</td><td>Bakersfield Traffic Advisories</td><td>BF</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$2')">Details</a></td><td>00091</td><td style="white-space:nowrap;">11:02 AM</td><td>CLOSURE of a Road</td><td>Cerro Noroeste Rd / Klipstein Canyon Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr><tr class="gvAltRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$3')">Details</a></td><td>00074</td><td style="white-space:nowrap;">10:15 AM</td><td>CLOSURE of a Road</td><td>Klipstein Canyon Rd / Sr166</td><td>&nbsp;</td><td>Buttonwillow</td>
            </tr><tr class="gvRow" align="left" style="white-space:nowrap;">
                <td class="gvSelectColumn"><a href="javascript:__doPostBack('gvIncidents','Select$4')">Details</a></td><td>00073</td><td style="white-space:nowrap;">10:14 AM</td><td>CLOSURE of a Road</td><td>Mil Potrero Hwy / Cerro Noroeste Rd</td><td>&nbsp;</td><td>Fort Tejon</td>
            </tr>
        </tbody></table>
    </div>


</div>
'''



soup = BeautifulSoup(html, "html.parser")
tables = soup.find('table')
table_rows = tables.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...