Как извлечь текст между тегами <hr>с помощью Beautifulsoup4? - PullRequest
0 голосов
/ 10 апреля 2020

Я изо всех сил пытаюсь найти метод для раздельного извлечения всего текста между всеми тегами hr, присутствующими в тексте в этом документе:

<html>
<head>
<!--Created 6-11-96 by Dan Axtell-->
<meta content="101 Elementary School Mission Statements compiled from the Web 11 June 1996" name="DESCRIPTION"/>
</head>
<body><a name="TOP">
<h2>101 Elementary School Mission Statements</h2>
11 June 1996
<p>
This list was compiled when the web was young. Most links don't work now. Note that some of these mission statements may be copyrighted. All material was pasted verbatim from the web pages, which accounts for the odd formating.
</p><hr/>
                                 WINDSOR Elementary School,
                            in partnership with its children, families,
                             community and Richland District Two,
                           guarantees each child a superior education
              by providing quality instruction and challenging learning experiences
                                in a safe and orderly environment
                  which will foster life-long learning and responsible citizenship.
</a>
<a href="http://www.scsn.net/users/rich2/elem/windsor/text.htm">http://www.scsn.net/users/rich2/elem/windsor/text.htm</a>
<hr/>

This We Believe...
Yokayo Elementary School provides a nurturing environment committed to achiving excellence. All students are challenged to
reach their maximum potential by learning at their functional level to provide a solid foundation of skills, knowledge and values.
This foundation enables each student to become a well-educated, productive adult able to cope with an ever changing world.

We believe that all learners must become:

     Effective Communicators who will use verbal, written, artistic and technological forms of communication to give,
     send, and receive information.
     Inspired Learners who are accountable for demonstrating, assessing, and directing their present and life-long
     intellectual growth.
     Productive Workers who perform collaboratively and independently to create quality products and services that
     reflect personal pride and responsiblility.
     Responsible Citizens who have a global and multi-cultural perspective, and who take the initiative for improving the
     quality of life for self and others.
     Resourceful Thinkers who independently and creatively strive to solve complex problems through reflection, risk
     taking, and critical evaluation.
<a href="http://happy.yokayo.uusd.k12.ca.us/Goals.html">http://happy.yokayo.uusd.k12.ca.us/Goals.html</a>
<hr/>
University Elementary School

                                    Mission Statement

  At University Elementary School, students should be accepted, appreciated, nurtured, and
challanged according to their individual needs.

  Through their education at school, students should gain the skills, strategies, and desire
necessary for continued learning.  They should also develop a strong sense of responsibility for
themselves and toward each other, their community, and the earth's resources.

 To this end, faculty and staff should create a rich multicultural environment for learning; design
an integrated curriculum with strong science, fine arts, and social studies components; provide for
children to become self-directed learners; and share their enthusiasm for learning, in an
atmosphere of mutual respect and appreciation.
<a href="http://www.intersource.com/~wmorales/ue/mission.html">http://www.intersource.com/~wmorales/ue/mission.html</a>
<hr/>

В документе 100 выдержек, и это всего лишь пример. Но форматирование всего остального остается неизменным. Я попытался использовать .nextSibling следующим образом:

for i in soup.find_all('hr'):
    print(i.nextSibling)

и получил вывод

                                 WINDSOR Elementary School,



University Elementary School

Altamont Elementary School

...

Как я могу расширить эту функцию, чтобы включить все до следующего тега hr, чтобы я мог извлечь все утверждения, как:

WINDSOR Elementary School,
                            in partnership with its children, families,
                             community and Richland District Two,
                           guarantees each child a superior education
              by providing quality instruction and challenging learning experiences
                                in a safe and orderly environment
                  which will foster life-long learning and responsible citizenship.
</a>
<a href="http://www.scsn.net/users/rich2/elem/windsor/text.htm">http://www.scsn.net/users/rich2/elem/windsor/text.htm</a>

Ответы [ 3 ]

0 голосов
/ 10 апреля 2020

Оставаясь с python и BeautifulSoup, попробуйте это:

for i in soup.select('hr'):
    print(i.next_element)
    print(i.next_element.next_element)
0 голосов
/ 11 апреля 2020

Чтобы получить полный оператор, попробуйте это:

import requests
from bs4 import BeautifulSoup, element
import pprint

html = requests.get('http://danaxtell.com/missions101.html').content
soup = BeautifulSoup(html, 'html.parser')
res = []
tmp = []
started = False
for elem in soup.find_all():
    if elem.name == "hr":
        if started:
            res.append(" ".join(tmp))
            tmp = []
        started = True
    if elem.text is not None and started:
        tmp.append(elem.text.strip())
    if elem.nextSibling is not None and isinstance(elem.nextSibling, element.NavigableString) and started:
        tmp.append(elem.nextSibling.strip())

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(res)

Он перебирает элементы. Он начинается с первого элемента HR и добавляет текст элемента, если таковой имеется, и текст следующего элемента, если он является текстовым узлом

ОБНОВЛЕНО:

Чтобы сделать эту работу более обобщенной c В случае, если вы можете использовать упорядоченные наборы:

import requests
from bs4 import BeautifulSoup, element
import pprint
from orderedset import OrderedSet

html = requests.get('http://danaxtell.com/missions101.html').content
soup = BeautifulSoup(html, 'html.parser')
res = []
tmp = OrderedSet()
started = False
for elem in soup.find_all():
    if elem.name == "hr":
        if started:
            res.append(" ".join(tmp))
            tmp = OrderedSet()
        started = True
    if elem.text is not None and started:
        tmp.add(elem.text.strip())
    if elem.nextSibling is not None and isinstance(elem.nextSibling, element.NavigableString) and started:
        tmp.add(elem.nextSibling.strip())

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(res)

Это остановит дублирование текста, который может встречаться на других веб-страницах.

0 голосов
/ 10 апреля 2020

вы можете использовать пользовательские теги вокруг тега hr, а затем захватить внутренний HTML. Но у вашего кода есть ряд проблем, и он нуждается в работе.

, просто обязательно добавьте da sh "-" в любой пользовательский тег, который вы используете.

var hrcustom = document.getElementsByTagName('hr-custom')[0].innerHTML;

console.log(hrcustom);
<html>
<head>
<!--Created 6-11-96 by Dan Axtell-->
<meta content="101 Elementary School Mission Statements compiled from the Web 11 June 1996" name="DESCRIPTION"/>
</head>
<body><a name="TOP">
<h2>101 Elementary School Mission Statements</h2>
11 June 1996
<p>
This list was compiled when the web was young. Most links don't work now. Note that some of these mission statements may be copyrighted. All material was pasted verbatim from the web pages, which accounts for the odd formating.
</p><hr/><hr-custom>
                                 WINDSOR Elementary School,
                            in partnership with its children, families,
                             community and Richland District Two,
                           guarantees each child a superior education
              by providing quality instruction and challenging learning experiences
                                in a safe and orderly environment
                  which will foster life-long learning and responsible citizenship.
</hr-custom>
</a>
<a href="http://www.scsn.net/users/rich2/elem/windsor/text.htm">http://www.scsn.net/users/rich2/elem/windsor/text.htm</a>
<hr/>

This We Believe...
Yokayo Elementary School provides a nurturing environment committed to achiving excellence. All students are challenged to
reach their maximum potential by learning at their functional level to provide a solid foundation of skills, knowledge and values.
This foundation enables each student to become a well-educated, productive adult able to cope with an ever changing world.

We believe that all learners must become:

     Effective Communicators who will use verbal, written, artistic and technological forms of communication to give,
     send, and receive information.
     Inspired Learners who are accountable for demonstrating, assessing, and directing their present and life-long
     intellectual growth.
     Productive Workers who perform collaboratively and independently to create quality products and services that
     reflect personal pride and responsiblility.
     Responsible Citizens who have a global and multi-cultural perspective, and who take the initiative for improving the
     quality of life for self and others.
     Resourceful Thinkers who independently and creatively strive to solve complex problems through reflection, risk
     taking, and critical evaluation.
<a href="http://happy.yokayo.uusd.k12.ca.us/Goals.html">http://happy.yokayo.uusd.k12.ca.us/Goals.html</a>
<hr/>
University Elementary School

                                    Mission Statement

  At University Elementary School, students should be accepted, appreciated, nurtured, and
challanged according to their individual needs.

  Through their education at school, students should gain the skills, strategies, and desire
necessary for continued learning.  They should also develop a strong sense of responsibility for
themselves and toward each other, their community, and the earth's resources.

 To this end, faculty and staff should create a rich multicultural environment for learning; design
an integrated curriculum with strong science, fine arts, and social studies components; provide for
children to become self-directed learners; and share their enthusiasm for learning, in an
atmosphere of mutual respect and appreciation.
<a href="http://www.intersource.com/~wmorales/ue/mission.html">http://www.intersource.com/~wmorales/ue/mission.html</a>
<hr/>
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...