как читать данные родительского заголовка <h>и <p>с помощью Beautifulsoup - PullRequest
0 голосов
/ 28 января 2019

Я хочу прочитать соответствующие данные заголовка <h1> и абзаца <p> в следующем примере ...

У меня много заголовков и абзацев, которые связаны между собой, поэтому, если я найду заголовок, тоМне нужно извлечь данные соответствующего абзаца:

<h1>Supplementary Materials </h1>\n
    <p />\n
    <p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
    <p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
    <p />

<h1>Supplementary Materials </h1>\n
    <p />\n
    <p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
    <p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
    <p />

1 Ответ

0 голосов
/ 28 января 2019

HTML действительно повторяется или это опечатка?

html = '''<h1>Supplementary Materials </h1>\n
    <p />\n
    <p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
    <p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
    <p />

<h1>Supplementary Materials </h1>\n
    <p />\n
    <p>The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. </p>\n
<h1>Testing data</h1>
    <p>The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.</p>\n
    <p /> '''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')
heads = soup.find_all('h1')

for head in heads:
    para = head.find_next('p', text=True).text
    print ('Header: %s\nParagraph: %s\n' %(head.text, para))

Вывод:

Header: Supplementary Materials 
Paragraph: The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. 

Header: Testing data
Paragraph: The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.

Header: Supplementary Materials 
Paragraph: The workshop entitled “Next generation MRA (Microbiological Risk Assessment); integration of Omics data into assessment” took place in Athens, Greece, May 13-14, 2016, and resulted in four papers that are published in this issue, namely, Cocolin et al., Rantsiou et al., Den Besten et al., and Haddad et al. 

Header: Testing data
Paragraph: The supplementary materials, Table S1 and Table S2, are integrated parts of these four papers.
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...