Красивый суп, как удалить ссылки * и * текст ссылки из супа - PullRequest
0 голосов
/ 08 ноября 2019

Я использую красивый суп, чтобы получить очищенный текст с веб-страницы - без HTML, только текст, который показан пользователю. Однако я не хочу, чтобы код видел текст с прикрепленной ссылкой как видимый текст. Чтобы прояснить, что я имею в виду здесь:

Этот текст является проблемой

Приведенный выше текст ссылается на документацию Beautiful Soup. В настоящее время я вырезал фактическую ссылку, но текст «Этот текст - проблема» остается. В идеале я хотел бы также удалить этот текст.

1 Ответ

1 голос
/ 08 ноября 2019

Вы можете извлечь теги <a> с помощью href. Либо сделать .extract() или .decompose():

Здесь это в полном объеме:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

Выход:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 
This text is the problem
The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

А затемудалив его:

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

for a in soup.findAll('a', href=True):
    a.extract()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)

Вывод:

I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: 

The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.

Вы также можете использовать .decompose():

from bs4 import BeautifulSoup

html = '''<div class="post-text" itemprop="text">
<p>I'm using beautiful soup to get some cleaned up text from a webpage - no html, just the text that's shown to the user. However I don't really want the code to see text that has a link attached as visible text. To make clear what I mean here: </p>
<p><a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow noreferrer">This text is the problem</a></p>
<p>The above text links to the Beautiful soup documentation. At present I cut out the actually link, but the text 'This text is the problem' remains. Ideally I would like to remove that text also.</p>
    </div>'''

soup = BeautifulSoup(html, 'html.parser')

soup.a.decompose()

p_tags = soup.find_all('p')

for each in p_tags:
    print (each.text)
...