Попробуйте следующий код, который идентифицирует find_all_next('td')
и проверьте, если условие нарушает dataset
.
import re
from bs4 import BeautifulSoup
html='''<table align="center" border="0" style="width:550px">
<tbody>
<tr>
<td colspan="2">USER_ID 11111</td>
</tr>
<tr>
<td colspan="2">string_a</td>
</tr>
<tr>
<td colspan="2"><strong>content: aaa</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 22222</td>
</tr>
<tr>
<td colspan="2">string_b</td>
</tr>
<tr>
<td colspan="2"><strong>content: bbb</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://aaa.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2">USER_ID 33333</td>
</tr>
<tr>
<td colspan="2">string_c</td>
</tr>
<tr>
<td colspan="2"><strong>content: ccc</strong></td>
</tr>
<tr>
<td colspan="2"><strong>date:</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
</tr>
<tr>
<td colspan="2"><strong>PID:</strong><strong>ABCDE</strong></td>
</tr>
<tr>
<td colspan="2"><strong>URL:https://ccc.com</strong></td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
<tr>
<td colspan="2"> </td>
</tr>
</tbody>
</table>'''
soup=BeautifulSoup(html,'html.parser')
final_list=[]
for item in soup.find_all('td',text=re.compile("USER_ID")):
row_list=[]
row_list.append(item.text.strip())
siblings=item.find_all_next('td')
for sibling in siblings:
if "USER_ID" in sibling.text:
break
else:
if sibling.text.strip()!='':
row_list.append(sibling.text.strip())
final_list.append(row_list)
print(final_list)
Выход :
[['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']]
Если вы хотите, чтобы каждый список печатался, попробуйте это.
soup=BeautifulSoup(html,'html.parser')
for item in soup.find_all('td',text=re.compile("USER_ID")):
row_list=[]
row_list.append(item.text.strip())
siblings=item.find_all_next('td')
for sibling in siblings:
if "USER_ID" in sibling.text:
break
else:
if sibling.text.strip()!='':
row_list.append(sibling.text.strip())
print(row_list)
Вывод :
['USER_ID 11111', 'string_a', 'content: aaa', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL:https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date:2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID:ABCDE', 'URL:https://ccc.com']