Вытащить каждый кусочек с ул с помощью Beautiful Soup - PullRequest
0 голосов
/ 18 января 2020

Я пытаюсь получить каждую ссылку из неупорядоченного списка, используя python. Как бы я go узнал о том, чтобы извлечь ссылку href из каждого элемента списка (т.е. потянуть href = "al / bessemer / 4921-promenade-parkway")?

uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
soup.find_all('ul', class_ = 'Directory-listLinks')

И возвращает это

[<ul class="Directory-listLinks"><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/bessemer/4921-promenade-parkway"><span class="Directory-listLinkText">Bessemer</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(3)" data-ya-track="todirectory" href="al/birmingham"><span class="Directory-listLinkText">Birmingham</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/cullman/1230-cullman-shopping-ctr-nw"><span class="Directory-listLinkText">Cullman</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/daphne/6850-13-highway-90"><span class="Directory-listLinkText">Daphne</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/decatur/1241-pointe-mallard-parkway"><span class="Directory-listLinkText">Decatur</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/dothan/3500-ross-clark-cir"><span class="Directory-listLinkText">Dothan</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/florence/390-cox-creek-parkway"><span class="Directory-listLinkText">Florence</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/foley/2528-s-mckenzie-street"><span class="Directory-listLinkText">Foley</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/fultondale/3453-lowery-parkway"><span class="Directory-listLinkText">Fultondale</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/gadsden/526-meighan-blvd-east"><span class="Directory-listLinkText">Gadsden</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(2)" data-ya-track="todirectory" href="al/huntsville"><span class="Directory-listLinkText">Huntsville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/montgomery/7670-east-chase-parkway"><span class="Directory-listLinkText">Montgomery</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/oxford/50-commons-way"><span class="Directory-listLinkText">Oxford</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/prattville/1472-cotton-exchange"><span class="Directory-listLinkText">Prattville</span></a></li><li class="Directory-listItem"><a class="Directory-listLink" data-count="(1)" data-ya-track="todirectory" href="al/tuscaloosa/1451-dr-edward-hillard-drive"><span class="Directory-listLinkText">Tuscaloosa</span></a></li></ul>]

Возвращает список с одним элементом со всем в одном индексе. Мне было интересно, как мне создать отдельные записи списка для каждого элемента списка, а затем извлечь из них ссылки href.

Спасибо!

Ответы [ 2 ]

0 голосов
/ 19 января 2020

Это мое решение с bs4 и urllib.request

from bs4 import BeautifulSoup
from urllib.request import urlopen

uri = 'https://locations.fivebelow.com/al'
html = urlopen(uri)
soup = BeautifulSoup(html, 'lxml')
li_list = (soup.find('ul', class_='Directory-listLinks')).find_all("li")
urls = []
for n in range(len(li_list)):
    urls.append("https://locations.fivebelow.com/" + str(str(li_list[n])[105:]).split('"')[1])

print(urls)

Результаты:

['https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway', 
'https://locations.fivebelow.com/al/birmingham', 
'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw', 
'https://locations.fivebelow.com/al/daphne/6850-13-highway-90', 
'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway', 
'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir', 
'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway', 
'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street', 
'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway', 
'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east', 
'https://locations.fivebelow.com/al/huntsville', 
'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway', 
'https://locations.fivebelow.com/al/oxford/50-commons-way', 
'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange', 
'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive']
0 голосов
/ 19 января 2020

Попробуйте решение SimplifiedDo c.

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
uri = 'https://locations.fivebelow.com/al'
html = req.get(uri)
doc = SimplifiedDoc(html)
lstA = doc.getElementByClass('Directory-listLinks').listA(url=uri)
print ([a.url for a in lstA])

Результат:

[u'https://locations.fivebelow.com/al/foley/2528-s-mckenzie-street', u'https://locations.fivebelow.com/al/oxford/50-commons-way', u'https://locations.fivebelow.com/al/decatur/1241-pointe-mallard-parkway', u'https://locations.fivebelow.com/al/prattville/1472-cotton-exchange', u'https://locations.fivebelow.com/al/bessemer/4921-promenade-parkway', u'https://locations.fivebelow.com/al/tuscaloosa/1451-dr-edward-hillard-drive', u'https://locations.fivebelow.com/al/daphne/6850-13-highway-90', u'https://locations.fivebelow.com/al/fultondale/3453-lowery-parkway', u'https://locations.fivebelow.com/al/dothan/3500-ross-clark-cir', u'https://locations.fivebelow.com/al/montgomery/7670-east-chase-parkway', u'https://locations.fivebelow.com/al/huntsville', u'https://locations.fivebelow.com/al/birmingham', u'https://locations.fivebelow.com/al/florence/390-cox-creek-parkway', u'https://locations.fivebelow.com/al/cullman/1230-cullman-shopping-ctr-nw', u'https://locations.fivebelow.com/al/gadsden/526-meighan-blvd-east']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...