как очистить таблицы Википедии с Python - PullRequest
0 голосов
/ 19 марта 2019

Я хочу извлечь URL таблицы https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia мой код не дает данных. как мы можем получить?

Код:

import requests
from bs4 import BeautifulSoup as bs
url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find_all('table',class_="wikitable sortable jquery-tablesorter")
print(ta)

Ответы [ 6 ]

0 голосов
/ 19 марта 2019

Если я потяну за стол и увижу теги <table>, я всегда попробую сначала Панд .read_html().Это сделает итерацию по строкам для вас.Большую часть времени вы можете получить именно то, что вам нужно, или, по крайней мере, вам придется лишь немного поработать над кадром данных.В этом случае он дает вам полную таблицу:

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
table = pd.read_html(url)[1]

Вывод:

print (table.to_string())
                                   0                   1                                  2                  3        4                                                  5
0                               Name            Industry                             Sector       Headquarters  Founded                                              Notes
1                  Airfast Indonesia   Consumer services                           Airlines          Tangerang     1971                                    Private airline
2                       Angkasa Pura         Industrials            Transportation services            Jakarta     1962                               State-owned airports
3                Astra International       Conglomerates                                  -            Jakarta     1957    Automotive, financials, industrials, technology
4                  Bank Central Asia          Financials                              Banks            Jakarta     1957                                               Bank
5                       Bank Danamon          Financials                              Banks            Jakarta     1956                                               Bank
6                       Bank Mandiri          Financials                              Banks            Jakarta     1998                                               Bank
7              Bank Negara Indonesia          Financials                              Banks            Jakarta     1946                                               Bank
8              Bank Rakyat Indonesia          Financials                              Banks            Jakarta     1895                                 Micro-finance bank
9                     Bumi Resources     Basic materials                     General mining            Jakarta     1973                                             Mining
10                            Djarum      Consumer goods                            Tobacco  Kudus and Jakarta     1951                                            Tobacco
11   Dragon Computer & Communication          Technology                  Computer hardware            Jakarta     1980                                  Computer hardware
12             Elex Media Komputindo   Consumer services                         Publishing            Jakarta     1985                                          Publisher
13                            Femina   Consumer services                              Media            Jakarta     1972                                    Weekly magazine
14                  Garuda Indonesia   Consumer services                   Travel & leisure          Tangerang     1949                                State-owned airline
15                      Gudang Garam      Consumer goods                            Tobacco             Kediri     1958                                            Tobacco
16                      Gunung Agung   Consumer services                Specialty retailers            Jakarta     1953                                         Bookstores
17       Indocement Tunggal Prakarsa         Industrials      Building materials & fixtures            Jakarta     1985         Cement, part of HeidelbergCement (Germany)
18                          Indofood      Consumer goods                      Food products            Jakarta     1968                                    Food production
19              Indonesian Aerospace         Industrials                          Aerospace            Bandung     1976                        State-owned aircraft design
20    Indonesian Bureau of Logistics      Consumer goods                      Food products            Jakarta     1967                                  Food distribution
21                           Indosat  Telecommunications      Fixed line telecommunications            Jakarta     1967                         Telecommunications network
22               Infomedia Nusantara   Consumer services                         Publishing            Jakarta     1975                                Directory publisher
23      Jalur Nugraha Ekakurir (JNE)         Industrials                  Delivery services            Jakarta     1990                                  Express logistics
24                       Kalbe Farma         Health care                    Pharmaceuticals            Jakarta     1966                                    Pharmaceuticals
25              Kereta Api Indonesia         Industrials                          Railroads            Bandung     1945                                State-owned railway
26                       Kimia Farma         Health care                    Pharmaceuticals            Jakarta     1971                                 State-owned pharma
27             Kompas Gramedia Group   Consumer services                     Media agencies            Jakarta     1965                                      Media holding
28                    Krakatau Steel     Basic materials                       Iron & steel            Cilegon     1970                                  State-owned steel
29                          Lion Air   Consumer services                           Airlines            Jakarta     2000                                   Low-cost airline
30                       Lippo Group          Financials  Real estate holding & development            Jakarta     1950                                        Development
31                          Matahari   Consumer services                Broadline retailers          Tangerang     1982                                  Department stores
32                       MedcoEnergi           Oil & gas           Exploration & production            Jakarta     1980                                Energy, oil and gas
33             Media Nusantara Citra   Consumer services       Broadcasting & entertainment            Jakarta     1997                                              Media
34                   Panin Sekuritas          Financials                Investment services            Jakarta     1989                                             Broker
35                         Pegadaian          Financials                   Consumer finance            Jakarta     1901                     State-owned financial services
36                             Pelni         Industrials              Marine transportation            Jakarta     1952                                           Shipping
37                     Pos Indonesia         Industrials                  Delivery services            Bandung     1995                         State-owned postal service
38                         Pertamina           Oil & gas               Integrated oil & gas            Jakarta     1957                    State-owned oil and natural gas
39             Perusahaan Gas Negara           Oil & gas           Exploration & production            Jakarta     1965                                                Gas
40             Perusahaan Gas Negara           Utilities                   Gas distribution            Jakarta     1965             State-owned natural gas transportation
41         Perusahaan Listrik Negara           Utilities           Conventional electricity            Jakarta     1945                State-owned electrical distribution
42  Phillip Securities Indonesia, PT          Financials                Investment services            Jakarta     1989                                 Financial services
43                            Pindad         Industrials                            Defense            Bandung     1808                                State-owned defense
44                PT Lapindo Brantas           Oil & gas           Exploration & production            Jakarta     1996                                        Oil and gas
45   PT Metro Supermarket Realty Tbk   Consumer services       Food retailers & wholesalers            Jakarta     1955                                       Supermarkets
46                       Salim Group       Conglomerates                                  -            Jakarta     1972            Industrials, financials, consumer goods
47                         Sampoerna      Consumer goods                            Tobacco           Surabaya     1913                                            Tobacco
48                   Semen Indonesia         Industrials      Building materials & fixtures             Gresik     1957                                             Cement
49                          Susi Air   Consumer services                           Airlines        Pangandaran     2004                                    Charter airline
50                  Telkom Indonesia  Telecommunications      Fixed line telecommunications            Bandung     1856                         Telecommunication services
51                         Telkomsel  Telecommunications          Mobile telecommunications            Jakarta     1995           Mobile network, part of Telkom Indonesia
52                        Trans Corp       Conglomerates                                  -            Jakarta     2006  Media, consumer services, real estate, part of...
53                Unilever Indonesia      Consumer goods                  Personal products            Jakarta     1933  Personal care products, part of Unilever (Neth...
54                   United Tractors         Industrials       Commercial vehicles & trucks            Jakarta     1972                                    Heavy equipment
55                           Waskita         Industrials                 Heavy construction            Jakarta     1961                           State-owned construction
0 голосов
/ 19 марта 2019

Если вы хотите проанализировать данные таблицы, вы можете сделать это, используя pandas, и очень эффективно, если вы хотите манипулировать данными таблицы, вы можете перемещаться по таблице, используя pandas DataFrame()

import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
table = pd.read_html(url,header=0)
print(table[1])
0 голосов
/ 19 марта 2019

попробуйте ниже,

import requests
from bs4 import BeautifulSoup as bs
URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(URL).text
soup = bs(html, 'html.parser')
ta=soup.find("table",{"class":"wikitable sortable"})
print(ta)

чтобы получить все таблицы

ta=soup.find_all("table",{"class":"wikitable sortable"})
0 голосов
/ 19 марта 2019

Может быть, это не то, что вы ищете.Но вы можете попробовать это.

import requests
from bs4 import BeautifulSoup as bs

url = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')

for data in soup.find_all('table', {"class":"wikitable"}):
    for td in data.find_all('td'):
        for link in td.find_all('a'):
            print (link.text)
0 голосов
/ 19 марта 2019

Исправления :

  1. Используйте URL вместо url в вашем коде (строка 4)
  2. Используйте класс wikitable
  3. Немного оптимизировал ваш код

Следовательно :

import requests
from bs4 import BeautifulSoup

page = requests.get("https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia")
soup = BeautifulSoup(page.content, 'html.parser')
ta = soup.find_all('table',class_="wikitable")

print(ta)

ВЫХОД :

[<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Image
</th>
<th>Name
</th>
<th>2016 Revenues (USD $M)
</th>
<th>Employees
</th>
<th>Notes
.
.
.
0 голосов
/ 19 марта 2019
import requests
from bs4 import BeautifulSoup as bs
URL = "https://en.wikipedia.org/wiki/List_of_companies_of_Indonesia"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find_all('table',{'class':'wikitable'})
print(ta)

Вы можете искать таблицу по имени класса, используя старый способ. Кажется, все еще работает.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...