Этот код дает мне все данные и сохраняет их в CSV. Чтобы упростить задачу, мне пришлось получить только вложенные таблицы.
Проблема в том, что в таблицах Sales per Business
, Sales per region
, Equities
есть вложенные столбцы, и это дает меньше заголовков, чем столбцов, и создает неправильный файл CSV. Вы должны добавить заголовки перед сохранением файлов для создания правильного CSV.
Для Sales per Business
, Sales per region
заголовки находятся в двух строках, поэтому я присоединяюсь к ним, используя zip()
(и используя del
, чтобы удалить вторую строку )
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/company/'
r = requests.get(url) #, headers={'user-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content, 'html.parser')
all_tables = []
for table in soup.select("table table.nfvtTab"):
table_rows = []
for tr in table.select('tr'):
row = []
for td in tr.select('td'):
#print(td)
item = td.get_text(strip=True, separator=' ')
#print(item)
row.append(item)
table_rows.append(row)
all_tables.append(table_rows)
# add headers for nested columns
#Sales per Business
all_tables[0][0].insert(2, '2018')
all_tables[0][0].insert(4, '2019')
all_tables[0][1].insert(0, '')
all_tables[0][1].insert(5, '')
# create one row with headers
headers = [f'{a} {b}'.strip() for a,b in zip(all_tables[0][0], all_tables[0][1])]
print('new:', headers)
all_tables[0][0] = headers # set new headers in first row
del all_tables[0][1] # remove second row
#Sales per region
all_tables[1][0].insert(2, '2018')
all_tables[1][0].insert(4, '2019')
all_tables[1][1].insert(0, '')
all_tables[1][1].insert(5, '')
# create one row with headers
headers = [f'{a} {b}'.strip() for a,b in zip(all_tables[1][0], all_tables[1][1])]
print('new:', headers)
all_tables[1][0] = headers # set new headers in first row
del all_tables[1][1] # remove second row
#Equities
all_tables[3][0].insert(4, 'Free-Float %')
all_tables[3][0].insert(6, 'Company-owned shares %')
for number, table in enumerate(all_tables, 1):
print('---', number, '---')
for row in table:
print(row)
for number, table in enumerate(all_tables, 1):
with open(f'table{number}.csv', 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerows(table)
Результат:
new: ['', '2018 USD (in Million)', '2018 %', '2019 USD (in Million)', '2019 %', 'Delta']
new: ['', '2018 USD (in Million)', '2018 %', '2019 USD (in Million)', '2019 %', 'Delta']
--- 1 ---
['', '2018 USD (in Million)', '2018 %', '2019 USD (in Million)', '2019 %', 'Delta']
['More Personal Computing', '42,276', '38.4%', '45,698', '36.4%', '+8.09%']
['Productivity and Business Processes', '35,865', '32.6%', '41,160', '32.8%', '+14.76%']
['Intelligent Cloud', '32,219', '29.2%', '38,985', '31.1%', '+21%']
--- 2 ---
['', '2018 USD (in Million)', '2018 %', '2019 USD (in Million)', '2019 %', 'Delta']
['United States', '55,926', '50.8%', '64,199', '51.2%', '+14.79%']
['Other Countries', '54,434', '49.4%', '61,644', '49.1%', '+13.25%']
--- 3 ---
['Name', 'Age', 'Since', 'Title']
['Satya Nadella', '52', '2014', 'Chief Executive Officer & Non-Independent Director']
['Bradford Smith', '60', '2015', 'President & Chief Legal Officer']
['John Thompson', '69', '2014', 'Independent Chairman']
['Kirk Koenigsbauer', '51', '2020', 'COO & VP-Experiences & Devices Group']
['Amy E. Hood', '47', '2013', 'Chief Financial Officer & Executive Vice President']
['James Kevin Scott', '54', '-', 'Chief Technology Officer & Executive VP']
['John W. Stanton', '64', '2014', 'Independent Director']
['Teri L. List-Stoll', '57', '2014', 'Independent Director']
['Charles Scharf', '53', '2014', 'Independent Director']
['Sandra E. Peterson', '60', '2015', 'Independent Director']
--- 4 ---
['', 'Vote', 'Quantity', 'Free-Float', 'Free-Float %', 'Company-owned shares', 'Company-owned shares %', 'Total Float']
['Stock A', '1', '7,583,440,247', '7,475,252,172', '98.6%', '0', '0.0%', '98.6%']
--- 5 ---
['Name', 'Equities', '%']
['The Vanguard Group, Inc.', '603,109,511', '7.95%']
['Capital Research & Management Co.', '556,573,400', '7.34%']
['SSgA Funds Management, Inc.', '314,771,248', '4.15%']
['Fidelity Management & Research Co.', '221,883,722', '2.93%']
['BlackRock Fund Advisors', '183,455,207', '2.42%']
['T. Rowe Price Associates, Inc. (Investment Management)', '172,056,401', '2.27%']
['Capital Research & Management Co. (World Investors)', '139,116,236', '1.83%']
['Putnam LLC', '121,797,960', '1.61%']
['Geode Capital Management LLC', '115,684,966', '1.53%']
['Capital Research & Management Co. (International Investors)', '103,523,946', '1.37%']
Код, который я использовал для тестирования файлов CSV:
import pandas as pd
df = pd.read_csv(f'table1.csv', index_col=0) #, header=[0,1])
print(df)
df = pd.read_csv(f'table2.csv', index_col=0) #, header=[0,1])
print(df)
df = pd.read_csv(f'table3.csv') #, index_col=0)
print(df)
df = pd.read_csv(f'table4.csv', index_col=0)
print(df)
df = pd.read_csv(f'table5.csv') #, index_col=0)
print(df)