BeautifulSoup заменяет разрывы строк на точку и пробел - PullRequest
0 голосов
/ 09 апреля 2019

Я очищаю несколько ссылок с BeautifulSoap.

Вот соответствующая часть исходного кода URL, который я отправляю:

<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>

Вот мой код BeautifulSoap (соответствующая частьтолько), чтобы получить текст в тегах description:

quote_page = sys.argv[1]
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')

description_box = soup.find('div', {'class':'description'})
description = description_box.get_text(separator=" ").strip()
print description

Запуск сценария с использованием python script.py https://example.com/page/2000 дает следующий вывод:

Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 

Как заменить разрыв строки на точку, за которой следует пробел, чтобы она выглядела следующим образом:

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

Есть идеи, как мне это сделать?

Ответы [ 4 ]

1 голос
/ 09 апреля 2019

Исходя из здесь :

html = '''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.
</div>'''
n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline 
cells = html.split(sep)


from bs4 import BeautifulSoup

html = sep.join(cells[:n]) + ". " + sep.join(cells[n:])
soup = BeautifulSoup(html, 'html.parser')
title_box = soup.find('div', attrs={'class': 'description'})
title = title_box.get_text().strip()
print (title)

OUTPUT

Planet Nine was initially proposed to explain the clustering of orbits. Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

EDIT

from bs4 import BeautifulSoup

page = requests.get("https://blablabla.com")
soup = BeautifulSoup(page.content, 'html.parser')
description_box  = soup.find('div', attrs={'class': 'description'})
description = description_box.get_text().strip()

n = 2                                # occurrence i.e. 2nd in this case
sep = '\n'                           # sep i.e. newline
cells = description.split(sep)
desired = sep.join(cells[:n]) + ". " + sep.join(cells[n:])

print (desired)
0 голосов
/ 09 апреля 2019

Используйте разделение и присоединитесь с помощью select

from bs4 import BeautifulSoup as bs

html = '''
<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>
'''
soup = bs(html, 'lxml')
text = ' '.join(soup.select_one('.description').text.split('\n'))
print(text)
0 голосов
/ 09 апреля 2019

Разбейте линию и затем присоединитесь, прежде чем приступить к анализу.

from bs4 import BeautifulSoup

htmldata='''<div class="description">
Planet Nine was initially proposed to explain the clustering of orbits
Of Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four. 
</div>'''
htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text)

Выход:

Planet Nine was initially proposed to explain the clustering of orbitsOf Planet Nine's other effects, one was unexpected, the perpendicular orbits, and the other two were found after further analysis. Although other mechanisms have been offered for many of these peculiarities, the gravitational influence of Planet Nine is the only one that explains all four.

РЕДАКТИРОВАНИЕ:

import requests
from bs4 import BeautifulSoup

htmldata=requests.get("url here").text

htmldata="".join(item.strip() for item in htmldata.split("\n"))
soup=BeautifulSoup(htmldata,'html.parser')
description_box = soup.find('div', class_='description')
print(description_box.text.strip())
0 голосов
/ 09 апреля 2019

Попробуйте это

description = description_box.get_text(separator=" ").rstrip("\n")
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...