Есть ли способ редактировать программно вложенные таблицы в файле html, используя BeatifulSoup? - PullRequest
2 голосов
/ 15 января 2020

Я очищаю таблицу на веб-странице с BeautifulSoup. Мне удалось поместить текст в текстовый файл.

Однако, некоторые содержат несколько таблиц внутри. Я предполагаю, что у разработчиков была некоторая эстетическая директива c, и они не могли редактировать ячейку любым другим способом, чтобы удовлетворить их требования. У меня много проблем с очисткой таблиц такими, какие они есть, поэтому мне было интересно, существует ли способ программно редактировать HTML для экстраполяции текста из этих вложенных таблиц в исходную ячейку.

Вот пример того, что я имею в виду.

Из такой вложенной таблицы

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which:</p>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                            <p class="normal">and</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

Я хотел бы отредактировать файл HTML, чтобы получить

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which: all the materials of Chapter&nbsp;4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating, — the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

из всех вложенных таблиц в ячейках.

1 Ответ

1 голос
/ 15 января 2020

Да, вы можете сделать это, если ваш html всегда будет таким. Найдите все columns внутри каждого rows, а затем проверьте, есть ли у столбца дочерние элементы table. Затем получите текст всех тегов P относительно этих столбцов и замените текст тега first P. Затем декомпозируйте () все теги таблицы из столбца.

Код:

html='''<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which:</p>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                            <p class="normal">and</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—</p>
                         </td>
                         <td valign="top">
                            <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>'''

soup=BeautifulSoup(html,'lxml')
for row in soup.find_all('tr',class_='table'):
    for col in row.find_all('td'):
        if col.findChildren("table"):
           #Get all the p tag text from col which contains table
           ptag_text=''.join([i.text for i in col.find_all('p')])
           #Get the first p tag and replace the value with previus value
           col.find('p').next_element.replace_with(ptag_text)
           for item in col.findChildren("table"):
                item.decompose()

print(soup)

Вывод :

<html><body><tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>



</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr></body></html>

Если вы не хотите, чтобы эти новые строки были, замените все новые строки, как показано ниже.

finalhtml=str(soup).replace('\n','')
print(finalhtml)

Вывод :

<html><body><tr class="table"><td class="table" valign="top"><p class="tbl-cod">0403</p></td><td class="table" valign="top"><p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p></td><td class="table" valign="top"><p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p></td><td class="table" valign="top"><p class="normal"> </p></td></tr></body></html>

Теперь, если вы хотите отформатировать снова, попробуйте это

finalhtml=str(soup).replace('\n','')
soup=BeautifulSoup(finalhtml,'lxml')
print(soup.prettify(formatter=None))

Вывод :

<html>
 <body>
  <tr class="table">
   <td class="table" valign="top">
    <p class="tbl-cod">
     0403
    </p>
   </td>
   <td class="table" valign="top">
    <p class="tbl-txt">
     Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa
    </p>
   </td>
   <td class="table" valign="top">
    <p class="tbl-txt">
     Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
    </p>
   </td>
   <td class="table" valign="top">
    <p class="normal">
    </p>
   </td>
  </tr>
 </body>
</html>
...