Я использую BeautifulSoup4, чтобы сделать некоторую очистку HTML.Я пытаюсь извлечь важную информацию, такую как заголовок, метаданные, абзацы и перечисленная информация.
Моя проблема в том, что я могу взять абзацы примерно так:
def main():
response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
html = response.read()
soup = BeautifulSoup(html,features="html.parser")
text = [e.get_text() for e in soup.find_all('p')]
article = '\n'.join(text)
print(article)
main()
Но если ссылка на мой сайт имеет маркеры в тексте, она будет содержать панель навигации.т.е. если я изменю p
на li
или ul
Например, что я хочу получить в качестве вывода:
The Industry Day's objectives are three-fold:
The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.
Что я на самом деле получаю: The Industry Day's objectives are three-fold:
Теги в HTML-источнике:
<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>