Очистка пользовательских рейтингов Голодных игр с использованием Selenium и BeautifulSoup - PullRequest
0 голосов
/ 09 февраля 2019

Я пытаюсь очистить все пользовательские рейтинги (из 5) для первой книги «Голодные игры» от goodreads.com.Самая большая проблема в том, что есть несколько страниц отзывов, но ссылка не меняется при отображении другой страницы комментариев.Вот почему я использую Selenium для навигации при поиске новой группы рейтингов.

Ниже вы можете увидеть мой код:

# initiating the chromedriver
path_to_chromedriver = r'./chromedriver.exe'

#launch url
url = "https://www.goodreads.com/book/show/2767052-the-hunger-games"

# create a new Chrome session
driver = webdriver.Chrome(executable_path=path_to_chromedriver)
driver.implicitly_wait(30)
driver.get(url)

# initiating the beautifulsoup
soup_1=BeautifulSoup(driver.page_source, 'lxml')

# finding the table that includes all the book reviews
user = soup_1.find('div', {'id': 'bookReviews'})

# finding all the individual ratings from that table
user = user.find_all('div',{'class':'friendReviews elementListBrown'})

# locating the next button on the page which is indicated with 'next »'
elm = driver.find_element_by_partial_link_text('next »')


for i in range(9): # since there are 10 pages of reviews

    for row in user: # finding for each separate rating

        rating = {}
        try: # try and except is needed because not all the users have a rating
            rating['name'] = row.find('a',{'class': 'user'}).text # grabbing the username
            rating['rating'] = row.find('span',{'class':'staticStars'})['title'] # grabbing user rating out of 5

            ratings.append(rating)

        except:

            pass


    elm.click() # clicking on the next button to scrape the other page

df_rev = pd.DataFrame(ratings) # merging all the results to build a data frame
df_rev

В конце я хочу получить каждого пользователя, который оценил и их оценки.Вместо этого я получаю фрейм данных, который содержит пользователей и их оценки только с первой дублированной страницы оценок, начиная с первого пользователя и до последнего пользователя на первой странице.

Результат:

name    rating
0   Kiki    liked it
1   Saniya  it was amazing
2   Khanh   it was amazing
3   Dija    it was amazing
4   Nataliya    really liked it
5   Jana    did not like it
6   Cecily  it was ok
7   Kiki    liked it
8   Saniya  it was amazing
9   Khanh   it was amazing
10  Dija    it was amazing
11  Nataliya    really liked it
12  Jana    did not like it
13  Cecily  it was ok
14  Kiki    liked it
15  Saniya  it was amazing
16  Khanh   it was amazing
17  Dija    it was amazing
18  Nataliya    really liked it
19  Jana    did not like it
20  Cecily  it was ok
21  Kiki    liked it
22  Saniya  it was amazing
23  Khanh   it was amazing
24  Dija    it was amazing
25  Nataliya    really liked it
26  Jana    did not like it
27  Cecily  it was ok
...

1 Ответ

0 голосов
/ 09 февраля 2019

Ну, насколько я вижу, вы даже не инициализировали ratings.

Но я внес небольшие изменения, и похоже, что это работает.Есть некоторые структурные вещи, которые я бы изменил в вашем коде.Ну очень много на самом деле.Но я думаю, это не нужно для вашего ответа.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os, sys
import pandas as pd
import pdfkit as pdf
import time
from bs4 import BeautifulSoup

driveletter = os.getcwd().split(':')[0]

options = Options()
options.binary_location = driveletter+":\PortableApps\GoogleChromePortable\App\Chrome-bin\chrome.exe"
options.add_argument('--headless')
driver = webdriver.Chrome(options=options, executable_path=driveletter+":\PortableApps\GoogleChromePortable\App\Chrome-bin\chromedriver.exe", )

#launch url
url = "https://www.goodreads.com/book/show/2767052-the-hunger-games"

# create a new Chrome session
driver.get(url)

ratings = list()

last_page_source = ''

while True:
    page_changed = False # It's useful to declare whether the page has changed or not
    attempts = 0
    while(not page_changed):
        if last_page_source != driver.page_source:
            page_changed = True
        else:
            if attempts > 5: # Decide on some point when you want to give up.
                break;
            else:
                time.sleep(3) # Give time to load new page. Interval could be shorter.
                attempts += 1
    if page_changed:
        soup_1 = BeautifulSoup(driver.page_source, 'lxml')
        user = soup_1.find('div', {'id': 'bookReviews'})
        user = user.find_all('div',{'class':'friendReviews elementListBrown'})

        for row in user: # finding for each separate rating

            rating = {}
            try:
                # try and except is needed because not all the users have a rating
                rating['name'] = row.find('a',{'class': 'user'}).text # grabbing the username
                rating['rating'] = row.find('span',{'class':'staticStars'})['title'] # grabbing user rating out of 5
                ratings.append(rating)

            except:
                pass
        last_page_source = driver.page_source
        next_page_element = driver.find_element_by_class_name('next_page')
        driver.execute_script("arguments[0].click();", next_page_element) # clicking on the next button to scrape the other page
    else:
        df_rev = pd.DataFrame(ratings) # merging all the results to build a data frame
        print(df_rev.drop_duplicates())
        break;

Вывод:

                                            name           rating
0                                           Kiki         liked it
1                                         Saniya   it was amazing
2    Khanh, first of her name, mother of bunnies   it was amazing
3                                           Dija   it was amazing
4                                       Nataliya  really liked it
5                                           Jana  did not like it
6                                         Cecily        it was ok
7                                Meredith Holley   it was amazing
8                                         Jayson  really liked it
9                               Chelsea Humphrey  really liked it
10                                 Miranda Reads  really liked it
11                                       ~Poppy~  really liked it
12                                        elissa   it was amazing
13                               Colleen Venable  really liked it
14                                         Betsy   it was amazing
15                                     Emily May  really liked it
16                                       Lyndsey   it was amazing
17                                      Morgan F   it was amazing
18                                    Huda Yahya         liked it
19                                Nilesh Kashyap        it was ok
20                                         Buggy   it was amazing
21                                         Tessa         liked it
22                                         Jamie   it was amazing
23                                 Richard Derus  did not like it
24                             Maggie Stiefvater   it was amazing
25                                         karen   it was amazing
26                                         James   it was amazing
27                                           Kai   it was amazing
28                                        Brandi  did not like it
29                                   Will Byrnes         liked it
..                                           ...              ...
263                                       shre ♡   it was amazing
264                                        Diane  really liked it
265                               Margaret Stohl   it was amazing
266                           Athena Shardbearer   it was amazing
267                                       Ashley         liked it
268                                Geo Marcovici   it was amazing
269                                        Pinky   it was amazing
270                                       Mariel  really liked it
271                                          Jim         liked it
272                                  Frannie Pan   it was amazing
273                                        Zanna  really liked it
274                                      Χαρά Ζ.  really liked it
275                     Anzu The Great Destroyer  really liked it
276                                         Beth   it was amazing
277                                        Karla  really liked it
278                                        Carla  did not like it
279                                       Shawna   it was amazing
280                             Susane Colasanti   it was amazing
281                                       Cherie  really liked it
283                                David Firmage         liked it
284                                       Farith   it was amazing
285                              Tony DiTerlizzi   it was amazing
286                                      Christy   it was amazing
287                                      Emerald   it was amazing
288                                       Sandra   it was amazing
289                           Chiara Pagliochini  really liked it
290                                       Argona   it was amazing
291                                      NZLisaM   it was amazing
292                                       Vinaya   it was amazing
293                                    Mac  Ross   it was amazing

[292 rows x 2 columns]

Объяснение: Вы инициализировали вашу BeautifulSoup на основе исходной страницы исходной ссылки.Вы никогда не меняли это вместе с кликами, которые вы сделали, чтобы изменить эту исходную страницу.

Редактировать: Пришлось вносить изменения в сом, так как я допустил ошибки в своем первоначальном ответе.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...