Очистите несколько страниц с Beautiful Soup - показать только последнюю страницу - PullRequest
0 голосов
/ 09 октября 2018

я и мои друзья каждый год делаем наброски в ФИФА, и я искал быстрый способ обновить информацию об игроке.

Что я хочу, так это собирать информацию от игроков на разных страницах.Страницы:

https://sofifa.com/players?offset=0
https://sofifa.com/players?offset=51
https://sofifa.com/players?offset=101

Это мой код:

import requests
from bs4 import BeautifulSoup
import re

pages = []

collection = ['0', '51', '101']
for i in collection:
    url = 'https://sofifa.com/players?offset=' + str(i)
    pages.append(url)

for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

table = soup.find('table', {'class': 'table-hover'})
tbody = table.find('tbody')
info = tbody.find_all('tr')


records = []
for infos in info:
    nome = infos.find("a", href=re.compile("/player/")).string
    numero = infos.find('img').get('id')
    idade = infos.find('div' , {'class': 'col-ae'}).string
    ponto = infos.find('div' , {'class': 'col-oa'}).string
    potencial = infos.find('div' , {'class': 'col-pt'}).string
    records.append((nome, numero, idade, ponto, potencial))

import pandas as pd
df = pd.DataFrame(records, columns=['nome', 'numero', 'idade', 'ponto', 'potencial'])

df.to_excel('jogadores.xls', index=False, encoding='utf-8')

Дело в том, что я получаю результаты только со страницы "101", первые страницы буксировки не анализируются.Я искал везде здесь, но безуспешно.

Что меня смущает, так это то, что если я напечатаю (суп), он покажет мне HTML с трех страниц!

Как я могурешить это?

Это и пример HTML-кода:

<table class="table table-hover persist-area">
    <thead>...</thead>
    <tbody>         <tr>
            <td>                        <figure class="avatar">
                        <img alt="" data-src="https://cdn.sofifa.org/players/4/19/158023.png" data-srcset="https://cdn.sofifa.org/players/4/19/158023@2x.png 2x, https://cdn.sofifa.org/players/4/19/158023@3x.png 3x" src="https://cdn.sofifa.org/players/4/19/158023.png" width="48" height="48" id="158023" class="player-check loaded" srcset="https://cdn.sofifa.org/players/4/19/158023@2x.png 2x, https://cdn.sofifa.org/players/4/19/158023@3x.png 3x" data-was-processed="true"></figure></td>
            <td>
                <div class="col-name text-ellipsis rtl"><a href="/players?na=52" rel="nofollow" title="Argentina"><img alt="" src="https://cdn.sofifa.org/flags/52.png" data-src="https://cdn.sofifa.org/flags/52.png" data-srcset="https://cdn.sofifa.org/flags/52@2x.png 2x, https://cdn.sofifa.org/flags/52@3x.png 3x" class="flag loaded" style="width:23px;height:17px" srcset="https://cdn.sofifa.org/flags/52@2x.png 2x, https://cdn.sofifa.org/flags/52@3x.png 3x" data-was-processed="true"></a> <a href="/player/158023" title="Lionel Messi">L. Messi</a>                    <div class="text-ellipsis rtl"><a rel="nofollow" href="/players?pn=21"><span class="pos pos21">CF</span></a> <a rel="nofollow" href="/players?pn=23"><span class="pos pos23">RW</span></a> <a rel="nofollow" href="/players?pn=25"><span class="pos pos25">ST</span></a></div>
                </div>
            </td>                   <td class="col text-center" data-col="ae">
                    <div class="col-digit col-ae">31</div>
                    </td>                   <td class="col text-center" data-col="oa">
                    <div class="col-digit col-oa"><span class="label p94">94</span></div>
                    </td>                   <td class="col text-center" data-col="pt">
                    <div class="col-digit col-pt"><span class="label p94">94</span></div>
                    </td>           <td>
                <div class="col-name text-ellipsis rtl">                        <figure class="avatar avatar-sm transparent">
                        <img alt="" class="team loaded" data-src="https://cdn.sofifa.org/teams/2/19/light/241.png" data-srcset="https://cdn.sofifa.org/teams/2/19/light/241@2x.png 2x, https://cdn.sofifa.org/teams/2/19/light/241@3x.png 3x" src="https://cdn.sofifa.org/teams/2/19/light/241.png" width="24" height="24" srcset="https://cdn.sofifa.org/teams/2/19/light/241@2x.png 2x, https://cdn.sofifa.org/teams/2/19/light/241@3x.png 3x" data-was-processed="true">
                        </figure>
                        <a href="/team/241">FC Barcelona</a>                    <div class="subtitle text-ellipsis rtl">2004 ~ 2021</div>
                </div>
            </td><th class="gap"></th>                      <td class="col text-center" data-col="vl">
                        <div class="col-digit col-vl">€110.5M</div>
                        </td>                       <td class="col text-center" data-col="wg">
                        <div class="col-digit col-wg">€565K</div>
                        </td><th class="gap"></th>                      <td class="col text-center" data-col="tt">
                        <div class="col-digit col-tt">2195</div>
                        </td>           <td class="gap"></td>
            <td>                    <div class="col-comments text-right text-ellipsis rtl">0.8K / 32.9K                 </div></td></tr>            <tr>
            <td>                        <figure class="avatar">
                        <img alt="" data-src="https://cdn.sofifa.org/players/4/19/20801.png" data-srcset="https://cdn.sofifa.org/players/4/19/20801@2x.png 2x, https://cdn.sofifa.org/players/4/19/20801@3x.png 3x" src="https://cdn.sofifa.org/players/4/19/20801.png" width="48" height="48" id="20801" class="player-check loaded" srcset="https://cdn.sofifa.org/players/4/19/20801@2x.png 2x, https://cdn.sofifa.org/players/4/19/20801@3x.png 3x" data-was-processed="true"></figure></td>
            <td>
                <div class="col-name text-ellipsis rtl"><a href="/players?na=38" rel="nofollow" title="Portugal"><img alt="" src="https://cdn.sofifa.org/flags/38.png" data-src="https://cdn.sofifa.org/flags/38.png" data-srcset="https://cdn.sofifa.org/flags/38@2x.png 2x, https://cdn.sofifa.org/flags/38@3x.png 3x" class="flag loaded" style="width:23px;height:17px" srcset="https://cdn.sofifa.org/flags/38@2x.png 2x, https://cdn.sofifa.org/flags/38@3x.png 3x" data-was-processed="true"></a> <a href="/player/20801" title="C. Ronaldo dos Santos Aveiro">Cristiano Ronaldo</a>                 <div class="text-ellipsis rtl"><a rel="nofollow" href="/players?pn=25"><span class="pos pos25">ST</span></a> <a rel="nofollow" href="/players?pn=27"><span class="pos pos27">LW</span></a></div>
                </div>
            </td>                   <td class="col text-center" data-col="ae">
                    <div class="col-digit col-ae">33</div>
                    </td>                   <td class="col text-center" data-col="oa">
                    <div class="col-digit col-oa"><span class="label p94">94</span></div>
                    </td>                   <td class="col text-center" data-col="pt">
                    <div class="col-digit col-pt"><span class="label p94">94</span></div>
                    </td>           <td>
                <div class="col-name text-ellipsis rtl">                        <figure class="avatar avatar-sm transparent">
                        <img alt="" class="team loaded" data-src="https://cdn.sofifa.org/teams/2/19/light/45.png" data-srcset="https://cdn.sofifa.org/teams/2/19/light/45@2x.png 2x, https://cdn.sofifa.org/teams/2/19/light/45@3x.png 3x" src="https://cdn.sofifa.org/teams/2/19/light/45.png" width="24" height="24" srcset="https://cdn.sofifa.org/teams/2/19/light/45@2x.png 2x, https://cdn.sofifa.org/teams/2/19/light/45@3x.png 3x" data-was-processed="true">
                        </figure>
                        <a href="/team/45">Juventus</a>                 <div class="subtitle text-ellipsis rtl">2018 ~ 2022</div>
                </div>
            </td><th class="gap"></th>                      <td class="col text-center" data-col="vl">
                        <div class="col-digit col-vl">€77M</div>
                        </td>                       <td class="col text-center" data-col="wg">
                        <div class="col-digit col-wg">€405K</div>
                        </td><th class="gap"></th>                      <td class="col text-center" data-col="tt">
                        <div class="col-digit col-tt">2228</div>
                        </td>           <td class="gap"></td>
            <td>                    <div class="col-comments text-right text-ellipsis rtl">0.8K / 40.1K                 </div></td></tr></tbody>
</table>

1 Ответ

0 голосов
/ 09 октября 2018

Это из-за отступа, который G_M уже указал.Попробуйте вместо этого:

import requests
from bs4 import BeautifulSoup
import re

pages = []
records = []

collection = ['0', '51', '101']
for i in collection:
    url = 'https://sofifa.com/players?offset=' + str(i)
    pages.append(url)

for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.text, 'html.parser')

    table = soup.find('table', {'class': 'table-hover'})
    tbody = table.find('tbody')
    info = tbody.find_all('tr')

    for infos in info:
        nome = infos.find("a", href=re.compile("/player/")).string
        numero = infos.find('img').get('id')
        idade = infos.find('div' , {'class': 'col-ae'}).string
        ponto = infos.find('div' , {'class': 'col-oa'}).string
        potencial = infos.find('div' , {'class': 'col-pt'}).string
        records.append((nome, numero, idade, ponto, potencial))
        print(nome,numero,idade,ponto,potencial)

import pandas as pd
df = pd.DataFrame(records, columns=['nome', 'numero', 'idade', 'ponto', 'potencial'])

df.to_excel('jogadores.xls', index=False, encoding='utf-8')
...