Question

Я играю с веб-страницей, содержащей карты mtg, и пытаюсь извлечь некоторую информацию о них. Следующая программа работает нормально, и я могу сканировать, выбросить страницу и получить всю необходимую информацию:

import re
from math import ceil
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

def NumOfNextPages(TotalCardNum, CardsPerPage):
    pages = ceil(TotalCardNum / CardsPerPage)
    return pages

URL = "xyz.com"
NumOfCrawledPages = 0

UClient = uReq(URL)  # downloading the url
page_html = UClient.read()
UClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")


# Finds all the cards that exist in the webpage and stores them as a bs4 object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
CardsPerPage = len(cards)


# Selects the card names, Power and Toughness, Set that they belong
for card in cards:

    card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

    if len(card.div.contents) > 3:
        cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
    else:
        cardP_T = "Does not exist"

    cardType = card.contents[3].text
    print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

# Trying to extract the next URL after this page, but there is not always a next page to retrieve, so an exception(IndexError) is produced due to our tries to access an index in a list that is empty, zero index is not available
try:
    URL_Next = "xyz.com/" + page_soup.findAll("li", {"class": 
"next"})[0].contents[0].get("href")
except IndexError:
    # End of crawling because of IndexError! Means that there is no next 
#page to crawl
    print("Crawling process completed! No more infomation to retrieve!")
else:
    print("The nex t URL is: " + URL_Next + "\n")
    NumOfCrawledPages += 1
finally:
    print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

# We need to find the overall card number available, to find the number of 
#pages that we need to crawl
# we drag those infomation from a "div" tag with class "summary"

OverallCardInfo = (page_soup.find("div", {"class": "summary"})).text
TotalCardNum = int(re.findall("\d+", OverallCardInfo)[2])
NumOfPages = NumOfNextPages(TotalCardNum, CardsPerPage)

С этим я могу сканировать первую страницу, которую я даю вручную, и извлекать некоторую информацию об общем количестве страниц, которые мне нужно сканировать, а также следующий URL.

В конечном счете, я бы хотелуказать начальную точку (веб-страницу), и тогда сканер самостоятельно переместится на другие веб-страницы.Поэтому я использовал следующее для цикла:

for i in range(0, NumOfPages):
    # The number of items shown by the search option on xyz.com can 
    #not be more than 10000
    if ((NumOfCrawledPages + 1) * CardsPerPage) >= 10000:
        print("Number of results provided can not exceed 10000!\nEnd of the 
crawling!")
        break

    if i == 0:
       Url = InitURL
    else:
        Url = URL_Next

    # opening up connection and crabbing the page
    UClient = uReq(Url)  # downloading the url
    page_html = UClient.read()
    UClient.close()

    # html parsing
    page_soup = soup(page_html, "html.parser")

    # Finds all the cards that exist in the webpage and stores them as a bs4 
#object
    cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})

    # Selects the card names, Power and Toughness, Set that they belong
    for card in cards:

        card_name = 
card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

        if len(card.div.contents) > 3:
            cardP_T = card.div.contents[3].contents[1].text.replace("\n", 
"").strip()
        else:
            cardP_T = "Does not exist"

        cardType = card.contents[3].text
        print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

    # Trying to extract the next URL after this page, but there is not our #tries to access an index in a list that is empty, zero index is not available
    try:
        URL_Next = "xyz.com" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
    except IndexError:
        # End of crawling because of IndexError! Means that there is no next #page to crawl
        print("Crawling process completed! No more infomation to retrieve!")
    else:
        print("The next URL is: " + URL_Next + "\n")
        NumOfCrawledPages += 1
        Url = URL_Next
    finally:
        print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

Второй код с дополнительным циклом for выполняется без ошибок, но результат не соответствует ожидаемому.Он возвращает результаты сканирования первой страницы, которую я ввожу вручную, и не переходит дальше на других страницах ...

почему это происходит?

Ожидаемый результат выглядит примерно так:

Шаман-говорящий дракон P / T: 2/2 Существо - Человек-варвар-шаман

Драконов-охотник P / T:3/3 Существо - Птичий Солдат

Следующий URL: xyz.com/......

Переход на страницу: 2

--------------------------------------------- конец сканирования первой страницы

Шаман-говорящий дракон P / T: существо 2/2 - Человек-варвар-шаман

Шаман-говорящий дракон: P / T: 2/2 существо - Человек-варвар-шаман

Драконов-охотник P / T: 3/3 Существо- Bird Soldier

Следующий URL-адрес: xyz.com/......

Перемещение на страницу: 3

После получения этой информации с заданной вручную веб-страницыдолжно продолжаться со следующей страницей, сохраненной в переменной Urlв цикле.Вместо этого он продолжает сканировать одну и ту же страницу снова и снова.Счетчик работает довольно хорошо, так как он подсчитывает количество просканированных страниц, но переменная Url, похоже, не меняет значения.

Неисправности гусеничного Python (bs4, urlopen)

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Неисправности гусеничного Python (bs4, urlopen)

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 0 ]

Нет похожих вопросов