Соскрести названия результатов поиска Google и URL-адреса с помощью Python - PullRequest
0 голосов
/ 31 мая 2019

Я работаю над проектом с использованием Python (3.7), в котором мне нужно почистить первые несколько результатов Google для заголовков и URL-адресов, я пробовал его с помощью BeautifulSoup, но он не работает:

Вот что я пробовал:

import requests
from my_fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")

result_div = soup.find_all('div', attrs={'class': 'g'})

links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = r.find('h3', attrs={'class': 'r'}).get_text()
        description = r.find('span', attrs={'class': 'st'}).get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except:
        continue

print(titles)

Но это ничего не возвращает.

Когда я пытаюсь получить HTML следующим образом:

url = 'https://google.com/search?q=python'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.prettify())

вот что он возвращает: (Добавлен пример возвращаемого HTML-кода)

<div id="main">
   <div class="ZINbbc xpd O9g5cc uUPGi">
    <div>
     <div class="jfp3ef">
      <a href="/url?q=https://www.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQFjAAegQIBxAB&amp;usg=AOvVaw0nCy-teBd7nOrThY5YGQ4o">
       <div class="BNeawe vvjwJb AP7Wnd">
        Python.org
       </div>
       <div class="BNeawe UPmit AP7Wnd">
        https://www.python.org
       </div>
      </a>
     </div>
     <div class="NJM3tb">
     </div>
     <div class="jfp3ef">
      <div>
       <div class="BNeawe s3v9rd AP7Wnd">
        <div>
         <div>
          <div class="Ap5OSd">
           <div class="BNeawe s3v9rd AP7Wnd">
            The official home of the Python Programming Language.
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAXoECAcQAw&amp;usg=AOvVaw0TKe6ApGOQcWuHcXIkvAT0">
              <span class="XLloXe AP7Wnd">
               Download Python
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/about/gettingstarted/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwAnoECAcQBQ&amp;usg=AOvVaw03o9Qt-KFSbwECm8-wmUZS">
              <span class="XLloXe AP7Wnd">
               Python For Beginners
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/doc/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwA3oECAcQBw&amp;usg=AOvVaw3Yz3mO8HXGJoaf35qhyb3V">
              <span class="XLloXe AP7Wnd">
               Documentation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://docs.python.org/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBHoECAcQCQ&amp;usg=AOvVaw0nY6NyZm0wErJJ1RIgTiPm">
              <span class="XLloXe AP7Wnd">
               Python Docs
              </span>
             </a>
            </span>
           </div>
          </div>
          <div class="v9i61e">
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/psf/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBXoECAcQCw&amp;usg=AOvVaw3HoEDHmdRBcufXuwakPCAz">
              <span class="XLloXe AP7Wnd">
               Python Software Foundation
              </span>
             </a>
            </span>
           </div>
          </div>
          <div>
           <div class="BNeawe s3v9rd AP7Wnd">
            <span class="BNeawe">
             <a href="/url?q=https://www.python.org/downloads/release/python-373/&amp;sa=U&amp;ved=2ahUKEwiCrK7AvsXiAhWxq1kKHTknCuoQjBAwBnoECAcQDQ&amp;usg=AOvVaw3HsJpvpsCvYikd_mP7ndN3">
              <span class="XLloXe AP7Wnd">
               Python 3.7.3
              </span>
             </a>
            </span>
           </div>
          </div>
         </div>
        </div>
       </div>
      </div>
     </div>
    </div>
   </div>
</div>

1 Ответ

2 голосов
/ 31 мая 2019

стоит попробовать автоматизацию библиотеки селена.он позволяет отбирать данные страницы запроса динамического рендеринга (js или ajax).

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag

driver = webdriver.Chrome('/usr/bin/chromedriver')
google_url = "https://www.google.com/search?q=python" + "&num=" + str(5)
driver.get(google_url)
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')
result_div = soup.find_all('div', attrs={'class': 'g'})


links = []
titles = []
descriptions = []
for r in result_div:
    # Checks if each element is present, else, raise exception
    try:
        link = r.find('a', href=True)
        title = None
        title = r.find('h3')

        if isinstance(title,Tag):
            title = title.get_text()

        description = None
        description = r.find('span', attrs={'class': 'st'})

        if isinstance(description, Tag):
            description = description.get_text()

        # Check to make sure everything is present before appending
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    # Next loop if one element is not present
    except Exception as e:
        print(e)
        continue

print(titles)
print(links)
print(descriptions)

O / P:

['Welcome to Python.org', 'Download Python | Python.org', 'Python Tutorial - W3Schools', 'Introduction to Python - W3Schools', 'Python Programming Language - GeeksforGeeks', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python: 7 Important Reasons Why You Should Use Python - Medium', 'Python Tutorial - Tutorialspoint', 'Python Download and Installation Instructions', 'Python vs C++ - Find Out The 9 Important Differences - eduCBA', None, 'Description']
['https://www.python.org/', 'https://www.python.org/downloads/', 'https://www.w3schools.com/python/', 'https://www.w3schools.com/python/python_intro.asp', 'https://www.geeksforgeeks.org/python-programming-language/', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://medium.com/@mindfiresolutions.usa/python-7-important-reasons-why-you-should-use-python-5801a98a0d0b', 'https://www.tutorialspoint.com/python/', 'https://www.ics.uci.edu/~pattis/common/handouts/pythoneclipsejava/python.html', 'https://www.educba.com/python-vs-c-plus-plus/', '/search?num=5&q=Python&stick=H4sIAAAAAAAAAONgFuLQz9U3MK0yjFeCs7SEs5Ot9JPzc3Pz86yKM1NSyxMri1cxsqVZOQZ4Fi9iZQuoLMnIzwMAlVPV1j0AAAA&sa=X&ved=2ahUKEwigvcqKx8XiAhUOSX0KHdtmBgoQzTooADAQegQIChAC', 'mailto:?body=Python%20https%3A%2F%2Fwww.google.com%2Fsearch%3Fkgmid%3D%2Fm%2F05z1_%26hl%3Den-IN%26kgs%3De1764a9f31831e11%26q%3DPython%26shndl%3D0%26source%3Dsh%2Fx%2Fkp%26entrypoint%3Dsh%2Fx%2Fkp']
['The official home of the Python Programming Language.', 'Looking for Python 2.7? See below for specific releases. Contribute to the PSF by Purchasing a PyCharm License. All proceeds benefit the PSF. Donate Now\xa0...', 'Python can be used on a server to create web applications. ... Our "Show Python" tool makes it easy to learn Python, it shows both the code and the result.', 'What is Python? Python is a popular programming language. It was created by Guido van Rossum, and released in 1991. It is used for: web development\xa0...', 'Python is a widely used general-purpose, high level programming language. It was initially designed by Guido van Rossum in 1991 and developed by Python\xa0...', None, None, None, None, None, None, None]

, где '/usr/bin/chromedriver' путь к веб-драйверу селена.

Загрузка веб-драйвера selenium для браузера Chrome:

http://chromedriver.chromium.org/downloads

Установка веб-драйвера для браузера Chrome:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Учебник Selenium:

https://selenium -python.readthedocs.io /

...