Как получить данные из строки HTML, включая идентификатор из тега внутри - PullRequest
0 голосов
/ 19 октября 2019

У меня есть образец HTML-текста, как показано ниже:

..........
<a href="d?racename=&country=1000&startmonth=1&endmonth=10&startdate=2018&enddate=2019&maxdist=unlimitied&class=any&x=1&order=winner&z=Px_8iD">Winner</a>
</th>
<th background="b8.gif" width="30" title="Winning time - click on this header to sort results by this column">
    <a href="d?racename=&country=1000&startmonth=1&endmonth=10&startdate=2018&enddate=2019&maxdist=unlimitied&class=any&x=1&order=wintime&z=Px_8iD">Wintime</a>
</th>
<th background="b8.gif" title="races with icon have video available for download">Film</th>
</tr>\n<tr>
    <td><a href="d?r=4552510&z=Px_8iD">OAKS AT LOGAN PARK (1-2 WINS)</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>7</td>
    <td>GR;Tier</td>
    <td>460;503</td>
    <td><a href="d?i=2390975">Madalia Ken</a></td>
    <td>26.00</td>
    <td></td>
</tr>\n<tr bgcolor="#cccccc">
    <td><a href="d?r=4552511&z=Px_8iD">AUSTRALIAN QUALITY PET FOODS</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>8</td>
    <td>GR;Grad</td>
    <td>460;503</td>
    <td><a href="d?i=2304665">Midnight Storm</a></td>
    <td>26.24</td>
    <td></td>
</tr>\n<tr>
    <td><a href="d?r=4552512&z=Px_8iD">EAST IVANHOE GROCERS</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>9</td>
    <td>GR;Grad</td>
    <td>400;437</td>
    <td><a href="d?i=2362422">Early Promise</a></td>
    <td>23.15</td>
    <td></td>
</tr>

Мне нужно извлечь данные в каждый столбец, как показано ниже:

                                                                    row 1
\n<tr ><td><a href="d?r=4552510&z=Px_8iD">                                  column name = "r_ID" , value = 4552510
OAKS AT LOGAN PARK (1-2 WINS)</a></td>                                      column name = "r_name" , value = OAKS AT LOGAN PARK (1-2 WINS)
<td>Warragul</td>                                                           column name = "s_name" , value = Warragul
<td>18;OCT;2019</td>                                                        column name = "date" , value = 18;OCT;2019
<td>7</td>                                                                  column name = "h" , value = 7
<td>GR;Tier</td>                                                            column name = "g" , value = GR;Tier
<td>460;503</td>                                                            column name = "d" , value = 460;503
<td><a href="d?i=2390975">                                                  column name = "w_ID" , value = 2390975
Madalia Ken</a></td>                                                        column name = "w_name" , value = Madalia Ken
<td>26.00</td>                                                              column name = "wt" , value = 26.00
<td></td></tr>                                                              column name = "f" , value = ''
                                                                    row 2
\n<tr  bgcolor="#cccccc" ><td><a href="d?r=4552511&z=Px_8iD">               column name = "r_ID" , value = 4552511
AUSTRALIAN QUALITY PET FOODS</a></td>                                       column name = "r_name" , value = AUSTRALIAN QUALITY PET FOODS
<td>Warragul</td>                                                           column name = "s_name" , value = Warragul
<td>18;OCT;2019</td>                                                        column name = "date" , value = 18;OCT;2019
<td>8</td>                                                                  column name = "h" , value = 8
<td>GR;Grad</td>                                                            column name = "g" , value = GR;Grad
<td>460;503</td>                                                            column name = "d" , value = 460;503
<td><a href="d?i=2304665">                                                  column name = "w_ID" , value = 2304665
Midnight Storm</a></td>                                                     column name = "w_name" , value = Midnight Storm
<td>26.24</td>                                                              column name = "wt" , value = 26.024
<td></td></tr>                                                              column name = "f" , value = ''
                                                                    row 3
\n<tr ><td><a href="d?r=4552512&z=Px_8iD">                                  column name = "r_ID" , value = 4552512
EAST IVANHOE GROCERS</a></td>                                               column name = "r_name" , value = EAST IVANHOE GROCERS
<td>Warragul</td>                                                           column name = "s_name" , value = Warragul
<td>18;OCT;2019</td>                                                        column name = "date" , value = 18;OCT;2019
<td>9</td>                                                                  column name = "h" , value = 9
<td>GR;Grad</td>                                                            column name = "g" , value = GR;Grad
<td>400;437</td>                                                            column name = "d" , value = 400;437
<td><a href="d?i=2362422">                                                  column name = "w_ID" , value = 2362422
Early Promise</a></td>                                                      column name = "w_name" , value = Early Promise
<td>23.15</td>                                                              column name = "wt" , value = 23.15
<td></td></tr>                                                              column name = "f" , value = ''

Я пробовал BeautifulSoup, но не работает, потому что: 1) часть данных находится внутри тега 2) когда я использую soup=getPage(url).find("table"), часть тега стала &gt;, пример: <a href="d?i=2383236">Porsche Monelli / a &gt; / t d &gt; t d &gt; 2 2 . 8 8 / t d &gt; t d &gt; / t d &gt; / t r &gt;

Любая помощь? Спасибо.

Ответы [ 2 ]

0 голосов
/ 20 октября 2019

@ chitown88

как читать таблицу внутри таблицы как две отдельные таблицы? Большое спасибо!

<table style="border-width:0px;width:100%;">

                    <tr valign="middle">

                        <td style="width:400px;"><span><span style='font-size: 12px;'>Race 1</span><br /><br /></span><span><span style='font-size: 12px;'><strong>Grade:</strong>&nbsp;&nbsp;M&nbsp;&nbsp;&nbsp;400 metres</span>
                            <br /></span>
                            <span><span style='font-size: 12px;'><strong>Prize Money:</strong> $1180</span>&nbsp;&nbsp;&nbsp;$825 - $235 - $120<br /><br /></span>
                            <table>

                            <tr valign="middle">

                                <td style="width:105px;"><span>Race Time:</span></td><td align="left" style="width:50px;"><span>(8.44)</span></td><td align="left" style="width:50px;"><span>(0.00)</span></td><td align="left" style="width:50px;"><span>(22.95)</span></td><td></td>

                            </tr><tr valign="middle">

                                <td style="width:105px;"><span>Sectional Time:</span></td><td align="left" style="width:50px;"><span>8.44</span></td><td align="left" style="width:50px;"><span>0.00</span></td><td align="left" style="width:50px;"><span>14.51</span></td><td></td>

                            </tr><tr valign="middle">

                                <td style="width:150px;"><span>1<sup>st</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' />&nbsp;<img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' />&nbsp;<img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' />&nbsp;<img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' />&nbsp;<img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' />&nbsp;</span></td>

                            </tr><tr valign="middle">

                                <td><span>2<sup>nd</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' />&nbsp;<img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' />&nbsp;<img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' />&nbsp;<img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' />&nbsp;<img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' />&nbsp;</span></td>

                            </tr>

                            </table>
                        </td>
                        <td class="ResultsPageRightColumn" valign="bottom"></td>

                    </tr>

                </table>
0 голосов
/ 20 октября 2019

Вам просто нужно будет пройтись по строкам, а затем поймать теги <'a'>, чтобы извлечь эти атрибуты. Я сбросил все данные в словарь, а затем просто превратил их в строку, которая добавляется к кадру данных. Тогда последний шаг - просто переименовать столбцы.

from bs4 import BeautifulSoup
import re
import pandas as pd

html_doc = """<a href="d?racename=&country=1000&startmonth=1&endmonth=10&startdate=2018&enddate=2019&maxdist=unlimitied&class=any&x=1&order=winner&z=Px_8iD">Winner</a>
</th>
<th background="b8.gif" width="30" title="Winning time - click on this header to sort results by this column">
    <a href="d?racename=&country=1000&startmonth=1&endmonth=10&startdate=2018&enddate=2019&maxdist=unlimitied&class=any&x=1&order=wintime&z=Px_8iD">Wintime</a>
</th>
<th background="b8.gif" title="races with icon have video available for download">Film</th>
</tr>\n<tr>
    <td><a href="d?r=4552510&z=Px_8iD">OAKS AT LOGAN PARK (1-2 WINS)</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>7</td>
    <td>GR;Tier</td>
    <td>460;503</td>
    <td><a href="d?i=2390975">Madalia Ken</a></td>
    <td>26.00</td>
    <td></td>
</tr>\n<tr bgcolor="#cccccc">
    <td><a href="d?r=4552511&z=Px_8iD">AUSTRALIAN QUALITY PET FOODS</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>8</td>
    <td>GR;Grad</td>
    <td>460;503</td>
    <td><a href="d?i=2304665">Midnight Storm</a></td>
    <td>26.24</td>
    <td></td>
</tr>\n<tr>
    <td><a href="d?r=4552512&z=Px_8iD">EAST IVANHOE GROCERS</a></td>
    <td>Warragul</td>
    <td>18;OCT;2019</td>
    <td>9</td>
    <td>GR;Grad</td>
    <td>400;437</td>
    <td><a href="d?i=2362422">Early Promise</a></td>
    <td>23.15</td>
    <td></td>
</tr>"""

soup = BeautifulSoup(html_doc, 'html.parser')

rows = soup.find_all('tr')

df = pd.DataFrame()
for row in rows:
    data = row.find_all('td')
    data_dict = {}
    idx = 0
    for each in data:
        try:
            if 'd?r' in each.find('a')['href'] or 'd?i' in each.find('a')['href']:
                rid = each.find('a')['href']
                temp = re.findall(r'\d+', rid) 
                res = list(map(int, temp)) 

                data_dict[idx] = res[0]
                idx+=1
                data_dict[idx] = each.find('a').text
                idx+=1
                continue

        except:
            pass

        data_dict[idx] = each.text
        idx+=1

    temp_df = pd.DataFrame([data_dict])
    df = df.append(temp_df, sort=True).reset_index(drop=True)

cols = ["r_ID" ,"r_name" ,"s_name" , "date" ,"h" , "g" ,"d" , "w_ID" , 
        "w_name" , "wt" , "f"]

df.columns = cols

Вывод:

print (df.to_string())
      r_ID                         r_name    s_name         date  h        g        d     w_ID          w_name     wt f
0  4552510  OAKS AT LOGAN PARK (1-2 WINS)  Warragul  18;OCT;2019  7  GR;Tier  460;503  2390975     Madalia Ken  26.00  
1  4552511   AUSTRALIAN QUALITY PET FOODS  Warragul  18;OCT;2019  8  GR;Grad  460;503  2304665  Midnight Storm  26.24  
2  4552512           EAST IVANHOE GROCERS  Warragul  18;OCT;2019  9  GR;Grad  400;437  2362422   Early Promise  23.15  
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...