Как связать удаленные абзацы в сети с последним очищенным заголовком из Википедии - PullRequest
1 голос
/ 10 января 2020

В настоящее время я перебираю страницы Википедии, чтобы найти каждый абзац, однако я также перебираю все заголовки, чтобы собрать их вместе. Затем я отправляю их через сумматор, чтобы получить важную информацию.

Я пытаюсь связать каждый заголовок с соответствующими абзацами, однако, если в заголовке несколько абзацев, он не будет знать, что и когда я напишу все Информация в текстовый файл помещает один заголовок, а затем один абзац, независимо от того, связаны ли они. Я не уверен, понятно ли то, что мне нужно, поэтому не стесняйтесь задавать вопросы.

Код, который я использую:

from bs4 import BeautifulSoup
import requests
from summarizer import summarize
# Here, we're just importing both Beautiful Soup and the Requests library

page_link = 'https://en.wikipedia.org/wiki/England'
# this is the url that we've already determined is safe and legal to scrape from.#

page_response = requests.get(page_link, timeout=5)
# here, we fetch the content from the url, using the requests library

page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.

# VVV this is where i find the paragraphs and the headings.
textContent = []
for i in range(0,100):
    paragraphs = page_content.find_all("p")[i].text
    while True:
        try:
            headings = page_content.find_all("h2")[i].text
            textContent.append(headings)
            break
        except IndexError:
            break
    textContent.append(paragraphs)
# this is the summariser
for i in range(len(textContent)):
    textContent[i] = summarize("{}".format(i),textContent[i], count=2)
# write to file here
with open('test.txt', 'w') as f:
    for item in textContent:
        f.write("%s\n" % item)
        f.write("\n")

Текущий вывод, который я получаю, таков: ['Toponymy ']

[' - \ xa0 в Европе \ xa0 (зеленый и \ xa0dark серый) - \ xa0 в Великобритании \ xa0 (зеленый) ']

[' История ']

['[5] [6] [7] Он разделяет сухопутные границы с Уэльсом на западе и Шотландией на севере.', 'Англия отделена от континентальной Европы Северным морем на востоке и Англией sh Канал на юг. ']

et c, et c, et c, а затем в конце есть просто группа абзацев, которые не могут быть соединены с заголовком.

Спасибо.

1 Ответ

0 голосов
/ 10 января 2020

Попробуйте следующий тег code.find_all ('h2'), а затем используйте find_next_siblings('p'), чтобы получить p тег после h2 до следующего h2 поиска.

from bs4 import BeautifulSoup
import requests

page_link = 'https://en.wikipedia.org/wiki/England'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h2')[1:]:
    texth2=tag.text.strip()
    textContent.append(texth2)
    for item in tag.find_next_siblings('p'):
        if texth2 in item.find_previous_siblings('h2')[0].text.strip():
            textContent.append(item.text.strip())


print(textContent)

Вывод на консоль :

 ['Toponymy', 'The name "England" is derived from the Old English name Englaland, which means "land of the Angles".[15] The Angles were one of the Germanic tribes that settled in Great Britain during the Early Middle Ages. The Angles came from the Anglia peninsula in the Bay of Kiel area (present-day German state of Schleswig–Holstein) of the Baltic Sea.[16] The earliest recorded use of the term, as "Engla londe", is in the late-ninth-century translation into Old English of Bede\'s Ecclesiastical History of the English People. The term was then used in a different sense to the modern one, meaning "the land inhabited by the English", and it included English people in what is now south-east Scotland but was then part of the English kingdom of Northumbria. The Anglo-Saxon Chronicle recorded that the Domesday Book of 1086 covered the whole of England, meaning the English kingdom, but a few years later the Chronicle stated that King Malcolm III went "out of Scotlande into Lothian in Englaland", thus using it in the more ancient sense.[17]', 'The earliest attested reference to the Angles occurs in the 1st-century work by Tacitus, Germania, in which the Latin word Anglii is used.[18] The etymology of the tribal name itself is disputed by scholars; it has been suggested that it derives from the shape of the Angeln peninsula, an angular shape.[19] How and why a term derived from the name of a tribe that was less significant than others, such as the Saxons, came to be used for the entire country and its people is not known, but it seems this is related to the custom of calling the Germanic people in Britain Angli Saxones or English Saxons to distinguish them from continental Saxons (Eald-Seaxe) of Old Saxony between the Weser and Eider rivers in Northern Germany.[20] In Scottish Gaelic, another language which developed on the island of Great Britain, the Saxon tribe gave their name to the word for England (Sasunn);[21] similarly, the Welsh name for the English language is "Saesneg". A romantic name for England is Loegria, related to the Welsh word for England, Lloegr, and made popular by its use in Arthurian legend. Albion is also applied to England in a more poetic capacity,[22] though its original meaning is the island of Britain as a whole.', 'History', 'The earliest known evidence of human presence in the area now known as England was that of Homo antecessor, dating to approximately 780,000 years ago. The oldest proto-human bones discovered in England date from 500,000\xa0years ago.[23] Modern humans are known to have inhabited the area during the Upper Paleolithic period, though permanent settlements were only established within the last 6,000 years.[24][25]\nAfter the last ice age only large mammals such as mammoths, bison and woolly rhinoceros remained. Roughly 11,000\xa0years ago, when the ice sheets began to recede, humans repopulated the area; genetic research suggests they came from the northern part of the Iberian Peninsula.[26] The sea level was lower than now and Britain was connected by land bridge to Ireland and Eurasia.[27]\nAs the seas rose, it was separated from Ireland 10,000\xa0years ago and from Eurasia two millennia later.', 
    ....so on]
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...