Question

Я новичок в Beautiful Soup, и у меня есть такие данные, которые содержат 3 набора пользовательских данных (для этого случая).

Я хочу получить всю информацию для каждого USER_ID и сохранить в базе данных.

ID пользователя
Название
Содержание
PID (не у каждого пользователя есть эта строка)
Дата
URL

<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID：</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>

Моя проблема в том, Все данные находятся только внутри td и не содержат имени div и родительского тега. Я не могу разделить на 3 набора данных.

Я попробовал следующий код, он может найти все USER_ID, но я не знаю, как получить другие данные для каждого USER_ID

soup = BeautifulSoup(content, 'html.parser')
p = soup.find_all('td', text=re.compile("^USER_ID"))
for item in p:
   title = item.find_next_siblings('td') # <--- return empty
   ...

, который я использую python 3,6 django 2.0.2

αԋɱҽԃ αмєяιcαη · Answer 1 · 30 апреля 2020

from bs4 import BeautifulSoup
import re
from more_itertools import split_when

data = """<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID：</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>"""

soup = BeautifulSoup(data, 'html.parser')

target = soup.find("table", align="center")

goal = [item.text for item in target.select(
    "td", text=re.compile("^USER_ID")) if item.text.strip() != '']


final = list(split_when(goal, lambda _, y: y.startswith("USER")))

print(final)  # list of lists

for x in final:  # or loop
    print(x)

Выход

[['USER_ID 11111', 'string_a', 'content: aaa', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID：ABCDE', 'URL：https://ccc.com']]

А

['USER_ID 11111', 'string_a', 'content: aaa', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID：ABCDE', 'URL：https://ccc.com']

KunduK · Answer 2 · 30 апреля 2020

Попробуйте следующий код, который идентифицирует find_all_next('td') и проверьте, если условие нарушает dataset.

import re
from bs4 import BeautifulSoup

html='''<table align="center" border="0" style="width:550px">
    <tbody>
        <tr>
            <td colspan="2">USER_ID 11111</td>
        </tr>
        <tr>
            <td colspan="2">string_a</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: aaa</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 22222</td>
        </tr>
        <tr>
            <td colspan="2">string_b</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: bbb</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://aaa.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">USER_ID 33333</td>
        </tr>
        <tr>
            <td colspan="2">string_c</td>
        </tr>
        <tr>
            <td colspan="2"><strong>content: ccc</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td>
        </tr>
        <tr>
            <td colspan="2"><strong>PID：</strong><strong>ABCDE</strong></td>
        </tr>
        <tr>
            <td colspan="2"><strong>URL：https://ccc.com</strong></td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
        <tr>
            <td colspan="2">&nbsp;</td>
        </tr>
    </tbody>
</table>'''

soup=BeautifulSoup(html,'html.parser')

final_list=[]
for item in soup.find_all('td',text=re.compile("USER_ID")):
    row_list=[]
    row_list.append(item.text.strip())
    siblings=item.find_all_next('td')
    for sibling in siblings:
        if "USER_ID" in sibling.text:
            break
        else:
            if sibling.text.strip()!='':
               row_list.append(sibling.text.strip())
    final_list.append(row_list)

print(final_list)

Выход :

[['USER_ID 11111', 'string_a', 'content: aaa', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com'], ['USER_ID 22222', 'string_b', 'content: bbb', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com'], ['USER_ID 33333', 'string_c', 'content: ccc', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID：ABCDE', 'URL：https://ccc.com']]

Если вы хотите, чтобы каждый список печатался, попробуйте это.

soup=BeautifulSoup(html,'html.parser')

for item in soup.find_all('td',text=re.compile("USER_ID")):
    row_list=[]
    row_list.append(item.text.strip())
    siblings=item.find_all_next('td')
    for sibling in siblings:
        if "USER_ID" in sibling.text:
            break
        else:
            if sibling.text.strip()!='':
               row_list.append(sibling.text.strip())
    print(row_list)

Вывод :

['USER_ID 11111', 'string_a', 'content: aaa', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com']
['USER_ID 22222', 'string_b', 'content: bbb', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'URL：https://aaa.com']
['USER_ID 33333', 'string_c', 'content: ccc', 'date：2020-05-01 00:00:00 To 2020-05-03 23:59:59', 'PID：ABCDE', 'URL：https://ccc.com']

0m3r · Answer 3 · 30 апреля 2020

Вы можете просто использовать soup.select('table tr')

Пример

from bs4 import BeautifulSoup

html = '<table align="center" border="0" style="width:550px"><tbody>' \
       '<tr><td colspan="2">USER_ID 11111</td></tr>' \
        '<tr><td colspan="2">string_a</td></tr>' \
        '<tr><td colspan="2"><strong>content: aaa</strong></td></tr>' \
        '<tr><td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
        '<tr><td colspan="2"><strong>URL：https://aaa.com</strong></td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr>' \
        '<tr><td colspan="2">USER_ID 22222</td></tr>' \
        '<tr><td colspan="2">string_b</td></tr>' \
        '<tr><td colspan="2"><strong>content: bbb</strong></td></tr>' \
        '<tr><td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
        '<tr><td colspan="2"><strong>URL：https://aaa.com</strong></td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr>' \
        '<tr><td colspan="2">USER_ID 33333</td></tr>' \
        '<tr><td colspan="2">string_c</td></tr>' \
        '<tr><td colspan="2"><strong>content: ccc</strong></td></tr>' \
        '<tr><td colspan="2"><strong>date：</strong>2020-05-01 00:00:00 To 2020-05-03 23:59:59</td></tr>' \
        '<tr><td colspan="2"><strong>PID：</strong><strong>ABCDE</strong></td></tr>' \
        '<tr><td colspan="2"><strong>URL：https://ccc.com</strong></td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr>' \
        '<tr><td colspan="2">&nbsp;</td></tr></tbody></table>'

soup = BeautifulSoup(html, features="lxml")
elements = soup.select('table tr')
print(elements)

for element in elements:
    print(element.text)

Печать

USER_ID 11111
string_a
content: aaa
date：2020-05-01 00:00:00 To 2020-05-03 23:59:59
URL：https://aaa.com
 
 
USER_ID 22222
string_b
content: bbb
date：2020-05-01 00:00:00 To 2020-05-03 23:59:59
URL：https://aaa.com
 
 
USER_ID 33333
string_c
content: ccc
date：2020-05-01 00:00:00 To 2020-05-03 23:59:59
PID：ABCDE
URL：https://ccc.com

Beautiful Soup - Извлечение данных содержит только тег td (без тегов, таких как div, id, class ....)

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Beautiful Soup - Извлечение данных содержит только тег td (без тегов, таких как div, id, class ....)

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Нет похожих вопросов