Beautifulsoup захватывает тег <p>между <h3> тегом - PullRequest
0 голосов
/ 08 ноября 2018

Итак, в основном, я хочу захватить весь текст (тег p) между h3 тегом автоматически и отчетливо .
Как написать код, который может захватить весь текст между h3?
Например, следующие две строки:

ПАРАГРАФ 1:

<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
    <p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>

ПАРАГРАФ 2:

<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

В тексте ниже. Я хочу написать код, а не хардкор, как указание строки, включающей тег p.

ТАК, ЧТО ЭТО МОЖЕТ АВТОМАТИЧЕСКИ * И ОТЛИЧНО Хватайте текст между h3 Текст, конечно, это может быть рушиться для других страниц, а не только на этой странице.

<h3>Stage I cancers</h3>
<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
<p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

Как я могу это сделать?

Ответы [ 3 ]

0 голосов
/ 08 ноября 2018
html = """<h3>Stage I cancers</h3><p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href='/cancer/small-cell-lung-cancer/treating/surgery.html'>surgery</a> to remove the tumor and the nearby lymph nodes.</p><p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p><h3>Other limited stage cancers</h3><p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href='/cancer/small-cell-lung-cancer/treating/chemotherapy.html'>"""


soup = BeautifulSoup(html, 'html.parser')
find = soup.find_all('h3')
for h3 in find:
    print(h3.text)
0 голосов
/ 09 ноября 2018

использовать find_next_sibling ()

from bs4 import BeautifulSoup

html = '''<h3>Stage I cancers</h3>
<p>If you only have one small tumoremove</p>
<p>People who arent healthy enough.</p>
<h2>Skip this</h2>
<p>also Skip this</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC</p>'''

soup = BeautifulSoup(html, 'html.parser')
for section in soup.findAll('h3'):
    nextNode = section
    print "=================== %s ===================" % section.text
    while True:
        nextNode = nextNode.find_next_sibling()
        if nextNode and nextNode.name == 'p':
            print nextNode
        else:
            print "-------------------- h3 end --------------------\n"
            break
0 голосов
/ 08 ноября 2018

Если у вас уже есть текст в переменной, тогда from bs4 import BeautifulSoup и запустите приведенный ниже код.В противном случае, если вы пытаетесь перейти на веб-сайт и очистить страницу, она немного отличается, потому что вам нужно import requests добавить переменную для url = 'whatever website', а затем переменную для page = 'requests.get(url)' and finally instead of the code below, soup = BeautifulSoup (page.text,'LXML').Сохраните переменную find и цикл for.Все это предполагает, что вы пытаетесь захватить только ВСЕ теги <h3> на странице.

html = """<h3>Stage I cancers</h3><p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href='/cancer/small-cell-lung-cancer/treating/surgery.html'>surgery</a> to remove the tumor and the nearby lymph nodes.</p><p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p><h3>Other limited stage cancers</h3><p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href='/cancer/small-cell-lung-cancer/treating/chemotherapy.html'>"""
soup = BeautifulSoup(html, 'lxml')
find = soup.findAll('h3')
for h3 in find:
    print(h3.text)
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...