Я использую urllib3 для получения html некоторых страниц.
Я хочу получить текст из абзаца, где находится ссылка, с текстом до и после ссылки, сохраненным отдельно.
Например:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('get', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr('href'):
if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
link_text = a
link_para = a.find_parent("p")
print(link_text)
print(link_para)
Абзац
<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>
Ссылка
<a href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a>
Текст для извлечения (2 части)
The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy ... was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An October 2000 article in
didn’t mention anything about little Michael’s medical
condition but said that his family was ... turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
Я не могу просто get_text (), а затем использовать split, поскольку текст ссылки может повторяться.
Я подумал, что мог бы просто добавить счетчик, чтобы увидеть, сколько раз текст ссылки повторяется, использовать split (), а затем использовать цикл, чтобы получить нужные части.
Буду признателенлучше, менее грязный метод.