BeautifulSoup возвращает None, но элемент определенно существует - PullRequest
0 голосов
/ 17 июня 2020

Я пытался очистить Amazon.com с помощью запросов python и библиотек BeautifulSoup, но наткнулся на проблемы. Я знаю, что могу использовать Selenium, и я пробовал его, и он работал, но мне все еще любопытно, почему это произошло и есть ли решение. Вот мой код:

# Searching python on Amazon
url = "https://www.amazon.com/s?k=Python"
# Deceiving Amazon that I am trying to reach them from a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}

page = requests.get(url, headers=headers)

soup = BeautifulSoup(page.content, "html.parser")
# Trying to get the element I need but prints "None"
print(soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row"))

Заранее спасибо.

Ответы [ 3 ]

3 голосов
/ 17 июня 2020

Правильное решение с использованием requests и BeautifulSoup:

import requests
from bs4 import BeautifulSoup as bs

headers = {
    'authority': 'www.amazon.com',
    'cache-control': 'max-age=0',
    'rtt': '300',
    'downlink': '1.35',
    'ect': '3g',
    'sec-ch-ua': '"Google Chrome"; v="83"',
    'sec-ch-ua-mobile': '?0',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
    'cookie': 'aws-priv=eyJ2IjoxLCJldSI6MCwic3QiOjB9; session-id=139-7350741-1081713; ubid-main=135-9894765-6184621; lc-main=en_US; s_fid=0A4730DDD06B62E4-1DB478AB62143F35; regStatus=pre-register; x-main=hd2N9IEBuVL7il1dbkhEEHTQSf4Q7uviwjc2eikr0hRGGOyI2RYIiRsk3GvDKLSx; at-main=Atza|IwEBIJdoAZ4Y6j2IIGvC29t1ha634aK-p2kAl8rHhQRCSGMSU_nwQvM6fakAbYEjpVLPU4Jj0TwKvX70d6QnlouKPh0QwpHJG8rHUNVb-gmhS9shHM8fCJk45r1XW2FOSpLoM1iAO9kYIpOoW2M5We9xfdqlLuQBB-D5fQeO5Vqew4RnHesPNZuF4DQNlcqL7wrGjDY1JQKzlzARfATAuwaCy4jMD5bNmxpcWtTgNGrTtLpGv1Y-4Mnx2axxQYFgwpRNv_sPNZrMAfHdU7MX67HbyPyV3V21KAl8QNl0xE-lNl3myxnfyWH68Z5D-j501S7HWzkKxopy3SfGuwwZTjSVSVlnH4RmTwvEnW8W3tndcX6X1ETysYYXmO7TudIjtq7aUZqPBJe_MViePcWL3OV4q2b5; sess-at-main="TjcvTeXAA2dP6HOMGcG/n+Cdkr+peDBlNMOvfBz6oE0="; sst-main=Sst1|PQGR5AF9x4yS-iMft3B9aBzJC8v-e4M1kmB_3KS0pxtVTj1cH8hl3fajgigt6xEYhan-kUJuY5KNbteBgbiyDIRCs4ISve5MdRhDdoy7XKrVD1g5McZTyvdwYLfbTJbTUov51hOyPcE8BKpFL1bGpJiiJbZ0TV7Pyc6tkndogjneZATDErc4U08WE4LwPJxCiF-I-7Av4-JEfwH1ZQ81mz6rqy-K1o6bCMRRZ8kWuzrl0wobKsr4Sz0-m1K0waguIewhXNm4V4DLe8mn-_6I8_k9p9v3NiFRpp04v0Ptzw8V1ARo2U18t5f2nx54EXwHzvzOQlpeBVY2U0WpXDcKsU3C8Q; session-id-time=2082787201l; i18n-prefs=USD; x-wl-uid=1MwJyD7dRnGiVdHw1PKiwmoNP9S/0xy+3KAKCJl2fM5VOthLzEW3dzyeW4zdKAepcIxkXpJFkxWcafUXXcS0MeSyLyFoBkl3xnNPLiRK0Rq33AHw0gL3W1FDBUn9OcakOzJGVGKZRc5E=; s_vn=1614974634531%26vn%3D4; s_nr=1590823888871-Repeat; s_vnum=2022823888872%26vn%3D1; s_dslv=1590823888874; sp-cdn="L5Z9:FR"; session-token=3AIPjoIrP8ITt1e/KXLZGSlnOPpirrWotNpCpCEfNRCY9mCfAV169URMcAX8XECtxt/qJujUn66Oyz8KIFDMieNmSdzEKA0K8I4AqbzplslzVGtZ6rNg+XsX/Bdc3hxnB7tUqQhrbrtVUncdzUMN1c95vhL7p+AEog3iiDkhLch0VO+Sl8HkAdZ/63xrp0stAaUsYo1GgsOFGI8+3wJUp4CHrJnoj/0lqjCJCpgXTZfxJcfWy9KarcGAPkno+fuMQqMoShJdi8R+DZ9XmIMib1bsLwXnerZa; csm-hit=tb:GVY0F2K4G05TXW59KB9M+s-GVY0F2K4G05TXW59KB9M|1592424615451&t:1592424615452&adb:adblk_yes',
}

params = (
    ('k', 'Python'),
    ('ref', 'nb_sb_noss'),
)

response = requests.get('https://www.amazon.com/s', headers=headers, params=params)
soup = bs(response.text,'lxml')
print(soup.find('div',class_='s-main-slot s-result-list s-search-results sg-row'))
0 голосов
/ 17 июня 2020

Измените парсер на lxml, он должен работать.

url = "https://www.amazon.com/s?k=Python"
# Deceiving Amazon that I am trying to reach them from a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'
}

page = requests.get(url, headers=headers)

soup = BeautifulSoup(page.content, "lxml")
# Trying to get the element I need but prints "None"
print(soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row"))

Вывод на моей консоли:

<div class="s-main-slot s-result-list s-search-results sg-row">
<div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1593279280" data-component-type="s-search-result" data-index="0" data-uuid="ae6080d7-b07e-4558-b38f-613931584787"><div class="sg-col-inner">
<span cel_widget_id="MAIN-SEARCH_RESULTS" class="celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results">
<div class="s-include-content-margin s-border-bottom s-latency-cf-section">
<div class="a-section a-spacing-medium">
<div class="sg-row">
<div class="a-section a-spacing-micro s-min-height-small">
<a class="a-link-normal" href="/gp/bestsellers/books/285856/ref=sr_bs_0_285856_1">
<span class="rush-component" data-component-props='{"badgeType":"best-seller","asin":"1593279280"}' data-component-type="s-status-badge-component">
<div class="a-row a-badge-region"><span aria-labelledby="1593279280-best-seller-label 1593279280-best-seller-supplementary" class="a-badge" data-a-badge-supplementary-position="right" data-a-badge-type="status" id="1593279280-best-seller" tabindex="0"><span aria-hidden="true" class="a-badge-label" data-a-badge-color="sx-orange" id="1593279280-best-seller-label"><span class="a-badge-label-inner a-text-ellipsis">
<span class="a-badge-text" data-a-badge-color="sx-cloud">Best Seller</span>
</span></span><span aria-hidden="true" class="a-badge-supplementary-text a-text-ellipsis" id="1593279280-best-seller-supplementary">in Python Programming</span></span></div>
</span>
</a>
</div>
</div>
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-none">
<span class="rush-component" data-component-type="s-product-image">
<a class="a-link-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&amp;keywords=Python&amp;qid=1592423942&amp;sr=8-1">
<div class="a-section aok-relative s-image-fixed-height">
<img alt="Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming" class="s-image" data-image-index="0" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" src="https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY327_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY436_QL65_.jpg 2x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY545_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/81f8XACISAL._AC_UY654_QL65_.jpg 3x"/>
</div>
</a>
</span>
</div>
</div></div>
<div class="sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28"><div class="sg-col-inner">
<div class="sg-row">
<div class="sg-col-4-of-12 sg-col-8-of-16 sg-col-12-of-32 sg-col-12-of-20 sg-col-12-of-36 sg-col sg-col-12-of-24 sg-col-12-of-28"><div class="sg-col-inner">
<div class="a-section a-spacing-none">
<h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">
<a class="a-link-normal a-text-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&amp;keywords=Python&amp;qid=1592423942&amp;sr=8-1">
<span class="a-size-medium a-color-base a-text-normal" dir="auto">Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming</span>
</a>
</h2>
<div class="a-row a-size-base a-color-secondary"><span class="a-size-base" dir="auto">by </span>
<a class="a-size-base a-link-normal" href="/Eric-Matthes/e/B01DPU378I?ref=sr_ntt_srch_lnk_1&amp;qid=1592423942&amp;sr=8-1">



            Eric Matthes


</a>
<span class="a-letter-space"></span><span class="a-size-base a-color-secondary" dir="auto"> | </span><span class="a-letter-space"></span><span class="a-size-base a-color-secondary a-text-normal" dir="auto">May 3, 2019</span></div>
</div>
<div class="a-section a-spacing-none a-spacing-top-micro">
<div class="a-row a-size-small">
<span aria-label="4.6 out of 5 stars">
<span class="a-declarative" data-a-popover='{"max-width":"700","closeButton":false,"position":"triggerBottom","url":"/review/widgets/average-customer-review/popover/ref=acr_search__popover?ie=UTF8&amp;asin=1593279280&amp;ref=acr_search__popover&amp;contextId=search"}' data-action="a-popover">
<a class="a-popover-trigger a-declarative" href="javascript:void(0)"><i class="a-icon a-icon-star-small a-star-small-4-5 aok-align-bottom"><span class="a-icon-alt">4.6 out of 5 stars</span></i><i class="a-icon a-icon-popover"></i></a>
</span>
</span>
<span aria-label="555">
<a class="a-link-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&amp;keywords=Python&amp;qid=1592423942&amp;sr=8-1#customerReviews">
<span class="a-size-base" dir="auto">555</span>
</a>
</span>
</div>
</div>
</div></div>
</div>
<div class="sg-row">
<div class="sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32"><div class="sg-col-inner">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
<a class="a-size-base a-link-normal a-text-bold" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&amp;keywords=Python&amp;qid=1592423942&amp;sr=8-1">



            Paperback


</a>
</div><div class="a-row a-size-base a-color-base"><div class="a-row">
<a class="a-size-base a-link-normal a-text-normal" href="/Python-Crash-Course-2nd-Edition/dp/1593279280/ref=sr_1_1?dchild=1&amp;keywords=Python&amp;qid=1592423942&amp;sr=8-1">
<span class="a-price" data-a-color="base" data-a-size="l"><span class="a-offscreen">$22.99</span><span aria-hidden="true"><span class="a-price-symbol">$</span><span class="a-price-whole">22<span class="a-price-decimal">.</span></span><span class="a-price-fraction">99</span></span></span>
<span class="a-price a-text-price" data-a-color="secondary" data-a-size="b" data-a-strike="true"><span class="a-offscreen">$39.95</span><span aria-hidden="true">$39.95</span></span>
</a>
</div></div><div class="a-row a-size-small a-color-secondary"><span dir="auto">Get 3 for the price of 2</span></div>
</div>
<div class="a-section a-spacing-none a-spacing-top-micro">
<div class="a-row a-size-base a-color-secondary s-align-children-center"><span class="a-size-small a-color-secondary" dir="auto">Ships to United Kingdom</span></div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span class="a-size-base a-color-secondary" dir="auto">More Buying Choices</span><br/><span class="a-color-base" dir="auto">$22.82</span><span class="a-letter-space"></span>
<a class="a-link-normal" href="/gp/offer-listing/1593279280/ref=sr_1_1?keywords=Python&amp;qid=1592423942&amp;sr=8-1&amp;dchild=1">



            (39 used &amp; new offers)


</a>
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row"><div class="a-row a-spacing-mini"><hr aria-hidden="true" class="a-spacing-mini a-divider-normal"/><div class="a-row a-size-base a-color-base">
<a class="a-size-base a-link-normal a-text-bold" href="/Python-Crash-Course-Eric-Matthes-ebook/dp/B07J4521M3/ref=sr_1_1?keywords=Python&amp;qid=1592423942&amp;sr=8-1">



            Kindle


</a>
</div><div class="a-row a-size-base a-color-base"><div class="a-row">
<a class="a-size-base a-link-normal a-text-normal" href="/Python-Crash-Course-Eric-Matthes-ebook/dp/B07J4521M3/ref=sr_1_1?keywords=Python&amp;qid=1592423942&amp;sr=8-1">
<span class="a-price" data-a-color="base" data-a-size="l"><span class="a-offscreen">$23.99</span><span aria-hidden="true"><span class="a-price-symbol">$</span><span class="a-price-whole">23<span class="a-price-decimal">.</span></span><span class="a-price-fraction">99</span></span></span>
<span class="a-price a-text-price" data-a-color="secondary" data-a-size="b" data-a-strike="true"><span class="a-offscreen">$39.95</span><span aria-hidden="true">$39.95</span></span>
</a>
</div></div></div></div>
</div>
</div></div>
<div class="sg-col-4-of-12 sg-col-8-of-28 sg-col-4-of-16 sg-col-8-of-32 sg-col sg-col-8-of-20 sg-col-8-of-36 sg-col-8-of-24"><div class="sg-col-inner">
</div></div>
</div>
<div class="sg-row">
<div class="sg-col-20-of-24 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-8-of-12 sg-col-12-of-16 sg-col-24-of-28"><div class="sg-col-inner">
</div></div>
</div>
<div class="sg-row">
<div class="sg-col-20-of-24 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-8-of-12 sg-col-12-of-16 sg-col-24-of-28"><div class="sg-col-inner">
</div></div>
</div>
</div></div>
</div>
</div>
</div>
</span>
</div></div>
<div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1449355730" data-component-type="s-search-result" data-index="1" data-uuid="047b9c10-2a93-4895-97f7-83778651c3f6"><div class="sg-col-inner">
<span cel_widget_id="MAIN-SEARCH_RESULTS" class="celwidget slot=MAIN template=SEARCH_RESULTS widgetId=search-results">
<div class="s-include-content-margin s-border-bottom s-latency-cf-section">
<div class="a-section a-spacing-medium">
<div class="sg-row">
<div class="a-section a-spacing-micro s-min-height-small">
<a class="a-link-normal" href="/gp/bestsellers/books/132561011/ref=sr_bs_1_132561011_1">
<span class="rush-component" data-component-props='{"badgeType":"best-seller","asin":"1449355730"}' data-component-type="s-status-badge-component">
<div class="a-row a-badge-region"><span aria-labelledby="1449355730-best-seller-label 1449355730-best-seller-supplementary" class="a-badge" data-a-badge-supplementary-position="right" data-a-badge-type="status" id="1449355730-best-seller" tabindex="0"><span aria-hidden="true" class="a-badge-label" data-a-badge-color="sx-orange" id="1449355730-best-seller-label"><span class="a-badge-label-inner a-text-ellipsis">
<span class="a-badge-text" data-a-badge-color="sx-cloud">Best Seller</span>
</span></span><span aria-hidden="true" class="a-badge-supplementary-text a-text-ellipsis" id="1449355730-best-seller-supplementary">in Functional Software Programming</span></span></div>
</span>
</a>
</div>
</div>
0 голосов
/ 17 июня 2020

Альтернатива с использованием Selenium Python также решает проблему

Используя selenium.webdriver, вы получаете браузер для вас. Например, ниже используется Google-Chrome webdriver.

Затем вы получите страницу результатов html, используя driver.page_source.

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
from bs4 import BeautifulSoup as Soup

options = ChromeOptions()
options.add_argument("headless") # to hide window in 'background'
driver = Chrome(options=options)

driver.get("https://www.amazon.com/s?k=Python")
html = driver.page_source
soup = Soup(html)
soup.find("div", class_="s-main-slot s-result-list s-search-results sg-row")

output

<div class="s-main-slot s-result-list s-search-results sg-row">
<div class="sg-col-20-of-24 s-result-item s-asin sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-asin="1593279280" data-component-id="6" data-component-type="s-search-result" data-index="0" data-uuid="c5f5837a-1f2e-4243-a520-a1936aac014e"><div class="sg-col-inner">
... etc.

Selenium python установка здесь

...