Википедия веб-соскоб с проблемой таблиц - PullRequest
0 голосов
/ 10 апреля 2019
from bs4 import BeautifulSoup as Soup,Tag
import requests

f=open("temples.txt","a+")

url=r"https://en.wikipedia.org/wiki/January_1"

r = requests.get(url)
soup = Soup(r.content,"html.parser" )

temple_span=soup.find("span",{"id":"Births"})
temples_ul=temple_span.parent.find_next_sibling()


for item in temples_ul.findAll('li'):
    if isinstance(item,Tag):
        print (item.text)

Но если между li и span есть дополнительные данные, они не работают. Пример: https://en.wikipedia.org/wiki/Lists_of_tourist_attractions

код:

</span></span></h3>
<div class="thumb tright"><div class="thumbinner" style="width:222px;"><a href="/wiki/File:Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg/220px-Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg" decoding="async" width="220" height="275" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg/330px-Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg/440px-Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg 2x" data-file-width="2400" data-file-height="3000" /></a>  <div class="thumbcaption"><div class="magnify"><a href="/wiki/File:Schwerin_Castle_Aerial_View_Island_Luftbild_Schweriner_Schloss_Insel_See.jpg" class="internal" title="Enlarge"></a></div><a 

    href="/wiki/Tourism_in_Germany" title="Tourism in Germany">Tourism in Germany</a> (<a href="/wiki/Schwerin_Palace" title="Schwerin Palace">Schwerin Palace</a>)</div></div></div>
    <div role="note" class="hatnote navigation-not-searchable">Main article: <a href="/wiki/Tourism_in_Germany" title="Tourism in Germany">Tourism in Germany</a></div>
    <ul><li><a href="/wiki/List_of_sights_in_Berlin" title="List of sights in Berlin">List of sights in Berlin</a>
    <ul><li><a href="/wiki/List_of_sights_of_Potsdam" class="mw-redirect" title="List of sights of Potsdam">List of sights of Potsdam</a></li></ul></li>
    <li><a href="/wiki/List_of_castles_in_Germany" title="List of castles in Germany">List of castles in Germany</a></li>
    <li><a href="/wiki/List_of_cathedrals_in_Germany" title="List of cathedrals in Germany">List of cathedrals in Germany</a></li>
    <li><a href="/wiki/List_of_museums_in_Germany" title="List of museums in Germany">List of museums in Germany</a></li>
    <li><a href="/wiki/List_of_tallest_structures_in_Germany" title="List of tallest structures in Germany">List of tallest structures in Germany</

Приведенный выше код не работает, потому что есть div. как я могу получить тот же вывод, что и выше, но только li

1 Ответ

0 голосов
/ 10 апреля 2019

Попробуйте:

from bs4 import BeautifulSoup as Soup,Tag
import requests

url=r"https://en.wikipedia.org/wiki/Lists_of_tourist_attractions"

r = requests.get(url)
soup = Soup(r.text,"html.parser" )

for ul in soup.findAll('div'):
    print(ul.text)
    for li in ul.findAll('li'):
        print(li.text)
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...