Python WebScraping - Попытка найти строки в таблице - PullRequest
0 голосов
/ 23 октября 2018

Попытка создать таблицу, в которую я хочу извлечь большую часть данных из таблицы.Я могу получить некоторые из строк, но я не могу получить отдельные tds должным образом.Что мне нужно сделать, чтобы извлечь данные TD?Мне нужно получить данные в формате tds, где имя имеет какое-либо значение, например, stand-table__cell, или я могу просто получить данные во всех tds и отсортировать их

Пример вывода -

[<tr class="standing-table__row">
<th class="standing-table__cell standing-table__header-cell" data-index="0" data-label="pos" title="Position">#</th>
<th class="standing-table__cell standing-table__header-cell standing-table__cell--name" data-index="1" title="Team">Team</th>
<th class="standing-table__cell standing-table__header-cell" data-index="2" data-label="pld" title="Played">Pl</th>
<th class="standing-table__cell standing-table__header-cell" data-index="9" data-label="pts" data-sort-value="use-attribute">Pts</th>
<th class="standing-table__cell standing-table__header-cell is-hidden--bp15 is-hidden--bp35 " data-index="10" data-sort-value="use-attribute">Last 6</th>
</tr>, <tr class="standing-table__row" data-item-id="345">
<td class="standing-table__cell">1</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Manchester City" data-short-name="Manchester City">
<a class="standing-table__cell--name-link" href="/manchester-city">Manchester City</a>
</td>
<td class="standing-table__cell">9</td>
<td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
<div class="standing-table__form">
<span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-1 Newcastle United"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 3-0 Fulham"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Cardiff City 0-5 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 2-0 Brighton and Hove Albion"> </span><span class="standing-table__form-cell standing-table__form-cell--draw" title="Liverpool 0-0 Manchester City"> </span><span class="standing-table__form-cell standing-table__form-cell--win" title="Manchester City 5-0 Burnley"> </span> </div>
</td>
</tr>, <tr class="standing-table__row" data-item-id="155">
<td class="standing-table__cell">2</td>
<td class="standing-table__cell standing-table__cell--name" data-long-name="Liverpool" data-short-name="Liverpool">
  File "C:\Users\scrape.py", line 18, in <module>
    for td in premier_soup_tr.find_all('td', {'class': 'standing-table__cell'}):
  File "C:\Python\Python36\lib\site-packages\bs4\element.py", line 1884, in __getattr__
    "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
>>> 

Мой код -

import requests
from bs4 import BeautifulSoup
url = 'https://www.skysports.com/premier-league-table'
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
print(premier_soup_tr)
for td in premier_soup_tr.find_all('td', {'class': 'standing-table__cell'}):
    print(td)

Источник HTML выглядит так -

    <tr class="standing-table__row" data-item-id="345">
  <td class="standing-table__cell">1</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Manchester City" data-long-name="Manchester City">

            <a href="/manchester-city" class="standing-table__cell--name-link">Manchester City</a>

  </td>
  <td class="standing-table__cell">9</td>
  <td class="standing-table__cell">23</td>
  <td class="standing-table__cell is-hidden--bp15 is-hidden--bp35 " data-sort-value="16313333">
          <div class="standing-table__form">
      <span title="Manchester City 2-1 Newcastle United" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 3-0 Fulham" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Cardiff City 0-5 Manchester City" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Manchester City 2-0 Brighton and Hove Albion" class="standing-table__form-cell standing-table__form-cell--win"> </span><span title="Liverpool 0-0 Manchester City" class="standing-table__form-cell standing-table__form-cell--draw"> </span><span title="Manchester City 5-0 Burnley" class="standing-table__form-cell standing-table__form-cell--win"> </span>        </div>
        </td>

</tr>
    <tr class="standing-table__row" data-item-id="155">
  <td class="standing-table__cell">2</td>
  <td class="standing-table__cell standing-table__cell--name" data-short-name="Liverpool" data-long-name="Liverpool">

            <a href="/liverpool" class="standing-table__cell--name-link">Liverpool</a>

  </td>

1 Ответ

0 голосов
/ 24 октября 2018

Вы сделали все правильно, но вам нужно что-то сделать с тем, что вы получили, и find_all вернет набор результатов, вы не можете сделать, как premier_soup_tr.find_all, правильный путь - premier_soup_tr[position].find_all

Вот что я сделал.

import requests
from bs4 import BeautifulSoup
url = 'https://www.skysports.com/premier-league-table'
premier_r = requests.get(url)
print(premier_r.status_code)
premier_soup = BeautifulSoup(premier_r.text, 'html.parser')
premier_soup_tr = premier_soup.find_all('tr', {'class': 'standing-table__row'})
result = [[r.text.strip() for r in td.find_all('td', {'class': 'standing-table__cell'})][:-1] for td in premier_soup_tr[1:]]
print(result)

Вывод:

[['1', 'Manchester City', '9', '7', '2', '0', '26', '3', '23', '23'], ['2', 'Liverpool', '9', '7', '2', '0', '16', '3', '13', '23'], ['3', 'Chelsea', '9', '6', '3', '0', '20', '7', '13', '21'], ['4', 'Arsenal', '9', '7', '0', '2', '22', '11', '11', '21'], ['5', 'Tottenham Hotspur', '9', '7', '0', '2', '16', '7', '9', '21'], ['6', 'Bournemouth', '9', '5', '2', '2', '16', '12', '4', '17'],
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...