BeautifulSoup soup.select обрезает дочерние теги - PullRequest
2 голосов
/ 17 марта 2019

При запуске скрипта для извлечения всех тегов blockquote с классом «FlatParagraph» я, кажется, обрезал некоторые дочерние теги в теге Blockquote. Есть ли запрос, который будет включать все дочерние теги? Кажется, проблема связана с набором тегов <blockquote><i><a>text<a/><i/>. Так что не проблема со всеми детьми.

Я использую следующий код

import urllib


from urllib.request import urlopen
from bs4 import BeautifulSoup

fhand = urllib.request.urlopen('https://www.legislation.qld.gov.au/view/whole/html/2018-07-01/sl-2006-0200').read()

soup = BeautifulSoup(fhand, 'html.parser')
fp = soup.select('blockquote[class="FlatParagraph"]')
for i in fp: 
    print(i.text)
    print('---------')

Затем я извлекаю текст из каждой строки, используя цикл for

changedfplist = list()
for i in fp:
    changedfplist.append(i.text.replace(u'\xa0', ' ').encode('utf-8'))

Вот пример того, что я анализирую -

<blockquote class="FlatParagraph"><blockquote class="Paragraph"><span class="ListNumber">(1)</span>This section applies if—<blockquote class="Paragraph List"><span class="ListNumber">(a)</span>before the commencement—<blockquote class="Paragraph List"><span class="ListNumber">(i)</span>a person applied under <a href="#sec.28">section&nbsp;28</a>(1) of the repealed regulation for approval of a proposed fire engineering design brief for stated building work; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(ii)</span>an authorised representative of the service attended a former fire engineering brief meeting relating to the approval of the proposed fire engineering design brief; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(iii)</span>the service had not decided whether or not to approve the proposed fire engineering design brief; and</blockquote>
</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(b)</span>the person has not paid the former fire engineering design brief meeting fee for the attendance of the representative of the service at the former fire engineering brief meeting.</blockquote>
</blockquote><blockquote class="Paragraph"><span class="ListNumber">(2)</span>For assessing the fire engineering design brief for the stated building work—<blockquote class="Paragraph List"><span class="ListNumber">(a)</span><a href="#sec.61">section&nbsp;61</a> applies as if the reference to a fire engineering brief were a reference to the proposed fire engineering design brief; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(b)</span><a href="#sec.62">section&nbsp;62</a>(1)(d) applies as if the reference to each fire engineering brief meeting included a reference to each former fire engineering brief meeting; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(c)</span><a href="#sch.2">schedule&nbsp;2</a>, <a href="#sch.2-pt.3">part&nbsp;3</a>, item 3 applies as if a reference to a meeting included a reference to a former fire engineering brief meeting.</blockquote>
</blockquote><blockquote class="Paragraph"><span class="ListNumber">(3)</span>In this section—<blockquote class="Paragraph-No-Number"><b><i><a name="sec.90-ssec.3-def.formerfireengineeringbriefmeeting"></a>former fire engineering brief meeting</i></b> means a fire engineering brief meeting under <a href="#sec.28">section&nbsp;28</a>(2)(d) of the repealed regulation.</blockquote><blockquote class="Paragraph-No-Number"><b><i><a name="sec.90-ssec.3-def.formerfireengineeringdesignbriefmeetingfee"></a>former fire engineering design brief meeting fee</i></b> means the fire engineering design brief meeting fee stated in <a href="#sch.3">schedule&nbsp;3</a> of the repealed regulation.</blockquote></blockquote></blockquote>

и когда я анализирую это, я получаю

(1) Этот раздел применяется, если - (a) до начала - (i) лицо применяется в соответствии с разделом 28 (1) отмененных правил для утверждения предлагаемое проектирование пожарной техники для заявленных строительных работ; и

(ii) уполномоченный представитель службы присутствовал на бывшем пожаре техническое совещание, посвященное утверждению предлагаемого бриф по пожарной технике; и

(iii) служба не решила, следует ли утверждать краткое описание предлагаемого проекта пожарной техники; и

(b) лицо не оплатило прежний бриф по проектированию пожарной техники плата за посещение представителя службы в бывшая пожарная брифинг.

(2) Для оценки технического задания на проектирование пожарной техники для строительные работы - (а) раздел 61 применяется, как если бы ссылка на пожар инженерное задание было ссылкой на предлагаемую пожарную технику краткое описание дизайна; и

(b) раздел 62 (1) (d) применяется, как если бы ссылка на каждый пожар краткое инженерное совещание включало ссылку на каждый бывший пожар краткое инженерное совещание; и

(c) график 2, часть 3, пункт 3 применяется, как если бы ссылка на собрание содержит ссылку на бывшее краткое совещание по пожарной технике.

(3) В этом разделе - бывшее краткое совещание по пожарной технике

В конце последней строки отсутствует текст. Он был отрезан в

<blockquote class="Paragraph-No-Number"><b><i><a name="sec.90-ssec.3-def.formerfireengineeringbriefmeeting"></a>former fire engineering brief meeting</i></b> 

ОБНОВЛЕНИЕ - есть класс, которого я пытаюсь избежать, поэтому использование .FlatParagraph не сработало. Я пытаюсь избежать class = FlatParagraph view-history-note. Примечание к истории просмотра FlatParagraph является классом дочернего тега тега класса FlatParagraph.

Я пробовал приведенный выше код с lxml и html.parser, и я получаю весь текст с lxml, а обрезанный текст с html.parser. Если кто-нибудь знает почему, я бы хотел это услышать!

1 Ответ

2 голосов
/ 17 марта 2019

вы можете использовать select() или find() см. Код ниже, я получаю полный текст!

html = '''
<blockquote class="FlatParagraph"><blockquote class="Paragraph"><span class="ListNumber">(1)</span>This section applies if—<blockquote class="Paragraph List"><span class="ListNumber">(a)</span>before the commencement—<blockquote class="Paragraph List"><span class="ListNumber">(i)</span>a person applied under <a href="#sec.28">section&nbsp;28</a>(1) of the repealed regulation for approval of a proposed fire engineering design brief for stated building work; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(ii)</span>an authorised representative of the service attended a former fire engineering brief meeting relating to the approval of the proposed fire engineering design brief; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(iii)</span>the service had not decided whether or not to approve the proposed fire engineering design brief; and</blockquote>
</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(b)</span>the person has not paid the former fire engineering design brief meeting fee for the attendance of the representative of the service at the former fire engineering brief meeting.</blockquote>
</blockquote><blockquote class="Paragraph"><span class="ListNumber">(2)</span>For assessing the fire engineering design brief for the stated building work—<blockquote class="Paragraph List"><span class="ListNumber">(a)</span><a href="#sec.61">section&nbsp;61</a> applies as if the reference to a fire engineering brief were a reference to the proposed fire engineering design brief; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(b)</span><a href="#sec.62">section&nbsp;62</a>(1)(d) applies as if the reference to each fire engineering brief meeting included a reference to each former fire engineering brief meeting; and</blockquote>
<blockquote class="Paragraph List"><span class="ListNumber">(c)</span><a href="#sch.2">schedule&nbsp;2</a>, <a href="#sch.2-pt.3">part&nbsp;3</a>, item 3 applies as if a reference to a meeting included a reference to a former fire engineering brief meeting.</blockquote>
</blockquote><blockquote class="Paragraph"><span class="ListNumber">(3)</span>In this section—<blockquote class="Paragraph-No-Number"><b><i><a name="sec.90-ssec.3-def.formerfireengineeringbriefmeeting"></a>former fire engineering brief meeting</i></b> means a fire engineering brief meeting under <a href="#sec.28">section&nbsp;28</a>(2)(d) of the repealed regulation.</blockquote><blockquote class="Paragraph-No-Number"><b><i><a name="sec.90-ssec.3-def.formerfireengineeringdesignbriefmeetingfee"></a>former fire engineering design brief meeting fee</i></b> means the fire engineering design brief meeting fee stated in <a href="#sch.3">schedule&nbsp;3</a> of the repealed regulation.</blockquote></blockquote></blockquote>
'''
soup = BeautifulSoup(html,'lxml')
fp = soup.select('.FlatParagraph')
for i in fp:
    print(i.text)

или

fp = soup.find('blockquote',attrs={'class':'FlatParagraph'})
print(fp.text)

Выход:

(1)This section applies if—(a)before the commencement—(i)a person applied under section 28(1) of the repealed regulation for approval of a proposed fire engineering design brief for stated building work; and
(ii)an authorised representative of the service attended a former fire engineering brief meeting relating to the approval of the proposed fire engineering design brief; and
(iii)the service had not decided whether or not to approve the proposed fire engineering design brief; and

(b)the person has not paid the former fire engineering design brief meeting fee for the attendance of the representative of the service at the former fire engineering brief meeting.
(2)For assessing the fire engineering design brief for the stated building work—(a)section 61 applies as if the reference to a fire engineering brief were a reference to the proposed fire engineering design brief; and
(b)section 62(1)(d) applies as if the reference to each fire engineering brief meeting included a reference to each former fire engineering brief meeting; and
(c)schedule 2, part 3, item 3 applies as if a reference to a meeting included a reference to a former fire engineering brief meeting.
(3)In this section—former fire engineering brief meeting means a fire engineering brief meeting under section 28(2)(d) of the repealed regulation.former fire engineering design brief meeting fee means the fire engineering design brief meeting fee stated in schedule 3 of the repealed regulation.
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...