Как обернуть определенные предложения тегом <mark>после извлечения их из абзацев при сохранении того же форматирования абзаца для окончательного вывода? - PullRequest
0 голосов
/ 20 июня 2019

У меня есть HTML-файл, который содержит только теги <p> и <a>. Как ниже -

<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>

Что мне нужно сделать, так это извлечь предложения с определенными свойствами, например: предложения, содержащие Britain или party. А затем пометьте все предложение тегами <mark>, сохранив форматирование абзаца таким, как оно есть.

Для этого -

  1. Сначала я удалил все теги, чтобы получить только чистые абзацы с чистыми предложениями.
  2. Тогда я использовал Пространство , чтобы извлечь предложения
with open('a.html') as f:
  given_text = f.read()    # Read from the file
#given_text = '' #copy paste the above html as string
nlp = spacy.load('en')
doc = nlp(given_text)
  1. Наконец, я перебираю предложения, используя for sent in doc.sents, и использую регулярное выражение, чтобы узнать, должно ли предложение быть помечено или нет.

Но проблема этого подхода заключается в том, что, как только я санирую текст (удаляя все теги <p> и <a>), я теряю все учетные записи отдельных абзацев. Поэтому, помечая предложения тегом, я получаю одну огромную строку.

Как сохранить форматирование <p>, но при этом можно перебирать предложения, чтобы пометить их?

Идея состоит в том, чтобы выводить точно так, как мы получили, за исключением нескольких выделенных предложений.

Ответы [ 3 ]

0 голосов
/ 20 июня 2019

Вот вариант

from bs4 import BeautifulSoup

html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>In 2016, Theresa May’s<a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title=""> rivals withdrew before the final round</a>. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.</p>'''
src_soup = BeautifulSoup(html_doc, 'html.parser')
dst_soup = BeautifulSoup('', 'html.parser')

WORDS_TO_LOOK_FOR = ['Britain', 'party']


def mark_if_needed(text):
    # can be improved using regex
    for word in WORDS_TO_LOOK_FOR:
        if word in text:
            return '<mark>' + text + '</mark>'
    return text


p_elements = src_soup.find_all('p')
for p in p_elements:
    a_elements = p.find_all('a')
    p.string = mark_if_needed(p.text)
    dst_soup.append(p)
    for a in a_elements:
        a.string = mark_if_needed(a.text)
        p.append(a)

print(dst_soup.prettify())

выход

<p>
 &lt;mark&gt;For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current Conservative party leadership contest proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.&lt;/mark&gt;
 <a href="https://www.theguardian.com/politics/conservative-leadership" title="">
  &lt;mark&gt;Conservative party leadership contest&lt;/mark&gt;
 </a>
</p>
<p>
 In 2016, Theresa May’s rivals withdrew before the final round. In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent.
 <a href="https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title="">
  rivals withdrew before the final round
 </a>
</p>
0 голосов
/ 27 июня 2019

После нескольких дней попыток я наконец понял, как это сделать. Ниже приведен полный пример кода для того же -

import re    
import spacy
from bs4 import BeautifulSoup

nlp = spacy.load('en_core_web_sm')

html_doc = '''<p>For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current <a href="https://www.theguardian.com/politics/conservative-leadership" title="">Conservative party leadership contest</a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This sentence should not be marked.</p> <p> This sentence should not be marked. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit.</p> <p>This is an unmarked random sentence. This year, those leaders — Gov. Andrew Cuomo; the Senate majority leader, Andrea Stewart-Cousins; and the Assembly speaker, Carl Heastie — deserve significant credit. Another unmarked random sentnce.</p>'''

src_soup = BeautifulSoup(html_doc, 'html.parser') 
dst_soup = BeautifulSoup('', 'html.parser')

word_re = "Britain"

def mark_if_needed(text):
    doc = nlp(text)
    for sent in doc.sents:
        check = re.search(word_re, sent.text)
        if check is None:
            yield (0, sent.text)
        else:
            yield (1, sent.text)

p_elements = src_soup.find_all('p')
for p in p_elements:
    s = BeautifulSoup()
    pp = BeautifulSoup()
    par = pp.new_tag('p')

    for sent in mark_if_needed(p.text):
        if sent[0] is 1:
            m = s.new_tag('mark') 
            m.append(sent[1])
            par.append(m)

        else:
            par.append(sent[1])

    dst_soup.append(par)

print(dst_soup.prettify())
html = dst_soup.prettify("utf-8")
with open("output.html", "wb") as file:
    file.write(html)
0 голосов
/ 20 июня 2019

Вы можете попробовать сделать что-то вроде этого:

  1. Найти предложения с britain или party.Я использую модуль re для выражения регулярных выражений.
  2. Замените эти предложения, добавив <mark>, добавьте начало и конец (изпредложение).

Вот код:

text = """<p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations. For example, if the current < a href = "https://www.theguardian.com/politics/conservative-leadership" title = "" > Conservative party leadership contest </a> proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < /p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >
"""



sentences_to_modify = re.findall(r"([^.]*?(party|Britain)[^.]*\.)", text)

for sentence in sentences_to_modify:
    text = text.replace(sentence[0], "<mark>"+sentence[0]+"<mark>")

print(text)
# <mark><p> For a country that takes pride in the venerable stability of its democracy, Britain is strangely prone to constitutional improvisations.<mark> For example, if the current < a href = "https://www.theguardian.<mark>com/politics/conservative-leadership" title = "" >
# Conservative party leadership contest < /a > proceeds as far as a ballot of party members, it will be the first time a prime minister is chosen by that method. < mark > < / p > <p > In 2016, Theresa May’s < a href = "https://www.theguardian.com/politics/2016/jun/30/conservative-leadership-race-who-are-the-five-candidates" title = "" > rivals withdrew before the final round < /a > . In previous applications of the
# rules it was the leader of the opposition being chosen, not a head of government. The system itself only dates back to 1998. Fine-tuning
# of the rules was completed by the 1922 Committee just three weeks ago. The process looks undemocratic and has no basis in ancient precedent. < /p >

Надеюсь, что поможет!

...