Прочитайте HTML-таблицу правильно - PullRequest
1 голос
/ 21 октября 2019

Вот таблица HTML:

<table width="100%" cellpadding="4" cellspacing="0" style="page-break-before: always">
        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <col width="32*"/>

        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">A</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">B</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">C</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">D</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">E</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">F</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">G</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">H</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">I</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">J</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">K</font></font></font></p>
                </td>
                <td colspan="2" width="25%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">L</font></font></font></p>
                </td>
        </tr>
        <tr valign="top">
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">M2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">N2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">O2</font></font></font></p>
                </td>
                <td width="12%" style="background: transparent" style="border: none; padding: 0cm"><p align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P</font></font></font></p>
                </td>
                <td width="13%" style="background: transparent" style="border: none; padding: 0cm"><p lang="ru-RU" align="left" style="font-variant: normal; font-style: normal; font-weight: normal; text-decoration: none">
                        <font color="#000000"><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt">P2</font></font></font></p>
                </td>
        </tr>
</table>

Последняя строка здесь содержит в 2 раза больше столбцов, чем другие. Когда я пытаюсь прочитать его в фрейм данных Pandas, я получаю такой результат:

table = pd.read_html('1111.html')
table[0]

   0   1  2   3  4   5  6   7
0  A   A  B   B  C   C  D   D
1  E   E  F   F  G   G  H   H
2  I   I  J   J  K   K  L   L
3  M  M2  N  N2  O  O2  P  P2

Как правильно прочитать его без дублирования? Мне не нужен последний ряд.

1 Ответ

0 голосов
/ 21 октября 2019

Вы можете использовать BeautifulSoup, чтобы проанализировать таблицу и затем преобразовать результаты в кадр данных:

import pandas as pd
from bs4 import BeautifulSoup as soup
df = pd.DataFrame([[k[1:-1] for i in b.find_all('td') if (k:=i.text) is not None] for b in soup(html, 'html.parser').table.find_all('tr')])

Вывод:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2

Редактировать: решение без выражения присваивания:

df = pd.DataFrame([[i.text[1:-1] if i else i for i in b.find_all('td')] for b in soup(html, 'html.parser').table.find_all('tr')])

Выход:

   0   1  2   3     4     5     6     7
0  A   B  C   D  None  None  None  None
1  E   F  G   H  None  None  None  None
2  I   J  K   L  None  None  None  None
3  M  M2  N  N2     O    O2     P    P2
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...