Да, вы можете сделать это, если ваш html
всегда будет таким. Найдите все columns
внутри каждого rows
, а затем проверьте, есть ли у столбца дочерние элементы table
. Затем получите текст всех тегов P относительно этих столбцов и замените текст тега first P
. Затем декомпозируйте () все теги таблицы из столбца.
Код:
html='''<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:</p>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the materials of Chapter 4 used are wholly obtained,</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,</p>
<p class="normal">and</p>
</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<colgroup><col width="4%">
<col width="96%">
</colgroup><tbody>
<tr>
<td valign="top">
<p class="normal">—</p>
</td>
<td valign="top">
<p class="normal">the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
</tr>
</tbody>
</table>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr>'''
soup=BeautifulSoup(html,'lxml')
for row in soup.find_all('tr',class_='table'):
for col in row.find_all('td'):
if col.findChildren("table"):
#Get all the p tag text from col which contains table
ptag_text=''.join([i.text for i in col.find_all('p')])
#Get the first p tag and replace the value with previus value
col.find('p').next_element.replace_with(ptag_text)
for item in col.findChildren("table"):
item.decompose()
print(soup)
Вывод :
<html><body><tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>
</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr></body></html>
Если вы не хотите, чтобы эти новые строки были, замените все новые строки, как показано ниже.
finalhtml=str(soup).replace('\n','')
print(finalhtml)
Вывод :
<html><body><tr class="table"><td class="table" valign="top"><p class="tbl-cod">0403</p></td><td class="table" valign="top"><p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p></td><td class="table" valign="top"><p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p></td><td class="table" valign="top"><p class="normal"> </p></td></tr></body></html>
Теперь, если вы хотите отформатировать снова, попробуйте это
finalhtml=str(soup).replace('\n','')
soup=BeautifulSoup(finalhtml,'lxml')
print(soup.prettify(formatter=None))
Вывод :
<html>
<body>
<tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">
0403
</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">
Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa
</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">
Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
</p>
</td>
<td class="table" valign="top">
<p class="normal">
</p>
</td>
</tr>
</body>
</html>