Web Scraping с BeautifulSoup, поиск текста внутри промежутка внутри td, игнорирование дочерних промежутков - PullRequest
0 голосов
/ 08 февраля 2020

Я пытаюсь почистить сайт, чтобы получить определенную информацию, и у меня трудное время.

Пример HTML файл:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <form>
        <table>
            <tbody>
                <tr id="dontMatter"></tr>
                <tr id="td_important_id_1">
                    <div class="dontCare"></div>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 1"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 2"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 3"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 4"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                </tr>
            </tbody>

        </table>
    </form>
</body>
</html>

По сути, я Я хочу получить все тексты, которые я хочу #, но ни один из дочерних элементов этого диапазона.

Попытка выполнить фильтрацию по этому идентификатору: "td_important_id_1" и дочерним элементам диапазона, имеющим класс важный_class_1. "и получите текст внутри этого промежутка, но ни один из дочерних промежутков.

То, что у меня сейчас есть, это:

import requests
from bs4 import BeautifulSoup

from selenium  import webdriver

driver = webdriver.Chrome(executable_path='path to driver')
driver.get('website_link')
soup = BeautifulSoup(driver.page_source, features="html.parser")


for item in soup.find("td", {"id" : "td_important_id_1"}).find_all("span", {"class" : "important_class_1"}, recursive=False):
    print(item.text)


driver.quit()

Но это своего рода мусор. Если кто-то может помочь с этим, это было бы здорово.

Ответы [ 2 ]

0 голосов
/ 08 февраля 2020

Вот еще одно решение только для справки.

from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<tr id="dontMatter"></tr>
<tr id="td_important_id_1">
    <div class="dontCare"></div>
    <span onClick="blah" class="important_class_1">
        ::before
        <input type="checkBox" name="">
        "Text That I want 1"
        <div class="label">
            <span class="garbagbe">Text that I dont want</span>
            <span class="garbagbe1">Text that I dont want</span>
            <span class="garbagbe2">Text that I dont want</span>
            <span class="garbagbe3">Text that I dont want</span>
        </div>
    </span>
    <span onClick="blah" class="important_class_1">
        ::before
        <input type="checkBox" name="">
        "Text That I want 2"
        <div class="label">
            <span class="garbagbe">Text that I dont want</span>
            <span class="garbagbe1">Text that I dont want</span>
            <span class="garbagbe2">Text that I dont want</span>
            <span class="garbagbe3">Text that I dont want</span>
        </div>
    </span>
</tr>
'''
doc = SimplifiedDoc(html)
items = doc.selects('tr#td_important_id_1>span.important_class_1')
for item in items:
  print (item.input.nextText())
  print ([s.text for s in item.selects('div.label>span')])

Результат:

"Text That I want 1"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']
"Text That I want 2"
['Text that I dont want', 'Text that I dont want', 'Text that I dont want', 'Text that I dont want']
0 голосов
/ 08 февраля 2020

Вы можете использовать .previous_sibling для перемещения между элементами страницы на одном уровне дерева разбора:

from bs4 import BeautifulSoup

data = '''
<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    <form>
        <table>
            <tbody>
                <tr id="dontMatter"></tr>
                <tr id="td_important_id_1">
                    <div class="dontCare"></div>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 1"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 2"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 3"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                    <span onClick="blah" class="important_class_1">
                        ::before
                        <input type="checkBox" name="">
                        "Text That I want 4"
                        <div class="label">
                            <span class="garbagbe">Text that I dont want</span>
                            <span class="garbagbe1">Text that I dont want</span>
                            <span class="garbagbe2">Text that I dont want</span>
                            <span class="garbagbe3">Text that I dont want</span>
                        </div>
                    </span>
                </tr>
            </tbody>

        </table>
    </form>
</body>
</html>
'''

soup = BeautifulSoup(data, 'html.parser')

[i.previous_sibling.strip() for i in soup.find_all('div', class_='label')]

, и вы получите:

['"Text That I want 1"',
 '"Text That I want 2"',
 '"Text That I want 3"',
 '"Text That I want 4"']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...