Отсутствующие значения при парсинге - PullRequest
1 голос
/ 12 июля 2020

У меня есть этот простой код очистки.

Проблема в том, что он работает на других страницах с тем же СОДЕРЖАНИЕМ, а некоторые - нет. Почему?

To be scraped

I have an image here. The underlined location is the one I am trying to scrape and the other one is the html code. Now take note that this scraping process works on the same content with the same HTML code as seen in the image BUT! in some PAGES it is NOT WORKING! I figured it out when I tried to print out the block of html code where that span should belong but when I used the find feature in sublime, I did not find it. So it means that the block of code is missing but in some pages it actually works!

I hope I am clear with what I am trying to say. Here is the source code. Take a look at it and try it then you'll get what I am saying.

from bs4 import BeautifulSoup
import requests

url2 = "https://www.ebay.com.au/itm/Auth-Bell-Ross-Black-PVD-Mens-Wrist-Watch-46MM-BR01-92-S-Box-Docs/383600988865?hash=item59506692c1:g:-0wAAOSw1BBeVzyE"
url = "https://www.ebay.com.au/itm/Bell-Ross-BR03-94-Ceramic-Desert-Type-Chronograph-Automatic-Watch/174344419661?hash=item2897bcb94d:g:7M4AAOSwb~Ve7y0M"
rawdata = requests.get(url2)
soup = BeautifulSoup(rawdata.content,"html.parser")#try xml parser


product_block = soup.find("div",{"id":"CenterPanelInternal"})

#print(product_block)

product_name = product_block.find("h1",class_="it-ttl").text
product_condition = product_block.find("div",class_="condText").text
product_price = product_block.find("span",class_="notranslate").text
product_seller = product_block.find("div",class_="bdg-90").text.replace("\n",'')
product_loc = product_block.find("div",class_="iti-eu-bld-gry")#.text.replace("\n",'')
product_postTo = product_block.find("div",class_="vi-shp-pdg-rt")#.text.replace("\n",'')
product_img = product_block.find("img",class_="img").get("src")

print(product_name)
print(product_condition)
print(product_price)
print(product_seller)
print(product_loc)
print(product_postTo)
print(product_img)

After running the code this is the result. It is None because that block of code does not exist. None result

Now after changing the url to url2 which url2 contains the same CONTENT! again same content but different page and data but the classes and ids from the html code are the same. Then I get this result. Верный

Честно говоря, это так странно. Пожалуйста, помогите мне :( Я что-то упускаю? Есть что-то, что я не понял? Пожалуйста, дайте мне знать. Вы можете скопировать ссылку, кстати, в коде :) Большое спасибо! Спасибо!

1 Ответ

1 голос
/ 12 июля 2020

Изменить синтаксический анализатор на html5lib:

import requests
from bs4 import BeautifulSoup


url2 = "https://www.ebay.com.au/itm/Bell-Ross-BR03-94-Ceramic-Desert-Type-Chronograph-Automatic-Watch/174344419661?hash=item2897bcb94d:g:7M4AAOSwb~Ve7y0M"

rawdata = requests.get(url2)
soup = BeautifulSoup(rawdata.content, "html5lib")  # <--- change to "html5lib" here


product_block = soup.find("div",{"id":"CenterPanelInternal"})

#print(product_block)

product_name = product_block.find("h1",class_="it-ttl").text
product_condition = product_block.find("div",class_="condText").text
product_price = product_block.find("span",class_="notranslate").text
product_seller = product_block.find("div",class_="bdg-90").text.replace("\n",'')
product_loc = product_block.find("div",class_="iti-eu-bld-gry")#.text.replace("\n",'')
product_postTo = product_block.find("div",class_="vi-shp-pdg-rt")#.text.replace("\n",'')
product_img = product_block.find("img",class_="img").get("src")

print(product_name)
print(product_condition)
print(product_price)
print(product_seller)
print('-' * 80)
print(product_loc)
print('-' * 80)
print(product_postTo)
print('-' * 80)
print(product_img)

Выводит:

...

--------------------------------------------------------------------------------
<div class="iti-eu-bld-gry">
            <span itemprop="availableAtOrFrom">Melbourne, Victoria, Australia</span>
        </div>
--------------------------------------------------------------------------------
<div class="iti-eu-bld-gry vi-shp-pdg-rt" id="vi-acc-shpsToLbl-cnt">
            <span itemprop="areaServed">
            Worldwide</span>
        </div>
--------------------------------------------------------------------------------

...
...