Как извлечь текст, который не имеет тега html с python - PullRequest
0 голосов
/ 13 января 2020

Как извлечь каждое предложение без тегов html, а затем добавить их в список.

Например

without_bracket = ['Jomi Jomi, okuroro ni i soni da', 'Joosua, ajooko bi eni wogbe.' etc.]

with_bracket = ['Insisting that one's children act like one makes one a wicked person', 'Joshua, a name that sounds like an act of jumping into the bush']
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>

Ответы [ 2 ]

0 голосов
/ 13 января 2020

Решения, использующие упрощение c.

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div>'''
doc = SimplifiedDoc(html)
lst = doc.div.getText('\n').split('\n')
# lst = doc.getElement('div',attr='id',value='post-body-627561819859082887')
# lst = doc.getElement('div',attr='class',value='post-body entry-content')
# lst = doc.getElement('div',attr='itemprop',value='description articleBody')
without_bracket = []
with_bracket = []
for l in lst:
  tmp = l.split('(')
  without_bracket.append(tmp[0].strip('-').strip())
  with_bracket.append(tmp[1].strip('.)').strip())
print (without_bracket)
print (with_bracket)

Результат:

['Jomi Jomi, okuroro ni i soni da..', 'Joosua, ajooko bi eni wogbe.', 'Ka gbekun yile, kii se egbe aja laelae', 'Kaka ko san fun alajapa, pipa lori igun n pa.', 'Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara.']
["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', 'The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', "what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', 'there is no stone like potash, it is matchless', 'The aeroplane has no business with a bad road']
0 голосов
/ 13 января 2020

Попробуйте что-нибудь подобное:

from bs4 import BeautifulSoup
import re

html = """
<div class='post-body entry-content' id='post-body-627561819859082887' itemprop='description articleBody'>
- Jomi Jomi, okuroro ni i soni da.. (Insisting that one's children act like one makes one a wicked person).<br />
- Joosua, ajooko bi eni wogbe. (Joshua, a name that sounds like an act of jumping into the bush).<br />
- Ka gbekun yile, kii se egbe aja laelae ( The fall of a leopard does not mean he can be likened to a dog).<br />
-Kaka ko san fun alajapa, pipa lori igun n pa. (Instead of things to get better for the trader, he is turning bald like a vulture).<br />
- Kini apari wa de iso onigbajamo.( what is a bald man doing in a barber's shop?)<br />
-Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade.(Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth).<br />
-Ko si iru kaun lawujo okuta.( there is no stone like potash, it is matchless.)<br />
-Kosi ohun to kan baalu pelu pe ona moto ko dara.( The aeroplane has no business with a bad road).<br />
<div style='clear: both;'></div>
</div> 
       """
soup = BeautifulSoup(html,'html.parser')
text=soup.find('div').text.rstrip()

with_bracket = re.findall('\(([^)]+)', text)
print(with_bracket) 
without_bracket=str(re.sub('\([^)]*\)','',text))
without_bracket=without_bracket.split('-')
without_bracket = [s.rstrip() for s in without_bracket]
without_bracket.remove('')
print(without_bracket)

Результат:

["Insisting that one's children act like one makes one a wicked person", 'Joshua, a name that sounds like an act of jumping into the bush', ' The fall of a leopard does not mean he can be likened to a dog', 'Instead of things to get better for the trader, he is turning bald like a vulture', " what is a bald man doing in a barber's shop?", 'Hanging upside down is the unique nature of a bat, any bird that tries to imitate this unique nature will see blood running down its mouth', ' there is no stone like potash, it is matchless.', ' The aeroplane has no business with a bad road']
[' Jomi Jomi, okuroro ni i soni da.. .', ' Joosua, ajooko bi eni wogbe. .', ' Ka gbekun yile, kii se egbe aja laelae .', 'Kaka ko san fun alajapa, pipa lori igun n pa. .', ' Kini apari wa de iso onigbajamo.', "Ko seye to le dori kodo bi adan, afi eyi ti eje yio t'enu re jade..", 'Ko si iru kaun lawujo okuta.', 'Kosi ohun to kan baalu pelu pe ona moto ko dara..']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...