хочу извлечь обзор, но получить некоторые проблемы - PullRequest
0 голосов
/ 21 ноября 2018

Мой сценарий, который я использую для извлечения рецензии на одну из книг:

URL: www.goodreads.com / book / show / 2657.To_Kill_a_Mockingbird

from selenium import webdriver
import time

driver = webdriver.Chrome()
time.sleep(3)

driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')

time.sleep(5)

reviews = driver.find_elements_by_css_selector("div.reviewText")
for r in reviews:
    spanText = r.find_element_by_css_selector("span.readable:nth-child(2)").text
    print("Span text:", spanText)

У меня проблема в том, что я не могу извлечь весь текст из div.reviewText> span , как в этом div>span имеется два вложенных span один содержит небольшой текст (для получения полного текста необходимо нажать на ссылку ... more ), а не полный, и второй span в div. содержит полный текст, поэтому я хочу получить текст от второго промежутка.Может кто-нибудь помочь мне, пожалуйста?

HTML (или вы можете посетить сайт, как указано выше)

<div class="reviewText stacked">
    <span id="reviewTextContainer35272288" class="readable">
        <span id="freeTextContainer13558188749606170457">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
            <br>
                <br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not
                </span>
                <span id="freeText13558188749606170457" style="display:none">If I could give this no stars, I would. This is possibly one of my least favorite books in the world, one that I would happily take off of shelves and stow in dark corners where no one would ever have to read it again.
                    <br>
                        <br>I think that To Kill A Mockingbird has such a prominent place in (American) culture because it is a naive, idealistic piece of writing in which naivete and idealism are ultimately rewarded. It's a saccharine, rose-tinted eulogy for the nineteen thirties from an orator who comes not to bury, but to praise. Written in the late fifties, TKAM is free of the social changes and conventions that people at the time were (and are, to some extent) still grating at. The primary dividing line in TKAM is not one of race, but is rather one of good people versus bad people -- something that, of course, Atticus and the children can discern effortlessly. 
                            <br>
                                <br>The characters are one dimensional. Calpurnia is the Negro who knows her place and loves the children; Atticus is a good father, wise and patient; Tom Robinson is the innocent wronged; Boo is the kind eccentric; Jem is the little boy who grows up; Scout is the precocious, knowledgable child. They have no identity outside of these roles. The children have no guile, no shrewdness--there is none of the delightfully subversive slyness that real children have, the sneakiness that will ultimately allow them to grow up. Jem and Scout will be children forever, existing in a world of black and white in which lacking knowledge allows people to see the truth in all of its simple, nuanceless glory. 
                                    <br>
                                        <br>I think that's why people find it soothing: TKAM privileges, celebrates, even, the child's point of view. Other YA classics--Huckleberry Finn; Catcher in the Rye; A Wrinkle in Time; The Day No Pigs Would Die; Are You There, God? It's Me, Margaret; Bridge to Terabithia--feature protagonists who are, if not actively fighting to become adults, at least fighting to find themselves as people. There is an active struggle throughout each of those books to make sense of the world, to define the world as something larger than oneself, as something that the protagonist can somehow be a part of. To Kill A Mockingbird has no struggle to become part of the world--in it, the children *are* the world, and everything else is just only relevant in as much as it affects them. There's no struggle to make sense of things, because to them, it already makes sense; there's no struggle to be a part of something, because they're already a part of everything. There's no sense of maturation--their world changes, but it leaves them, in many ways, unchanged, and because of that, it fails as a story for me. The whole point of a coming of age story--which is what TKAM is generally billed as--is that the characters come of age, or at least mature in some fashion, and it just doesn't happen. 
                                            <br>
                                                <br>All thematic issues aside, I think that the writing is very, er, uneven, shall we say? Overwhelmingly episodic, not terribly consistent, and largely as dimensionless as the characters.
                                                    <br>
                                                    </span>
                                                    <a data-text-id="13558188749606170457" href="#" onclick="swapContent($(this));; return false;">...more</a>
                                                </span>
                                            </div>

Ответы [ 2 ]

0 голосов
/ 21 ноября 2018

используйте get_attribute() для извлечения скрытого контента, и вам не нужен ненужный сон

driver = webdriver.Chrome()

driver.get('https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird')

reviews = driver.find_elements_by_css_selector("span.readable span:nth-child(2)")
for r in reviews:
    spanText = r.get_attribute('textContent')
    print("Span text:", spanText)
0 голосов
/ 21 ноября 2018

Второй диапазон скрыт, поэтому вы не можете получить его содержимое с помощью свойства text.

Вам нужно попробовать

spanText = r.find_elements_by_css_selector("span.readable > span")[-1].get_attribute('textContent')

, чтобы получить содержимое скрытого элемента

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...