Лучшим подходом было бы зациклить другой элемент, например <li>
, а затем найти необходимые элементы внутри него.
Чтобы получить коспонсоров, сначала нужно проверить, есть ли они, проверивчисло.Если это не 0
, то сначала получите ссылку на подстраницу.Запросите эту подстраницу, используя отдельный объект BeautifulSoup.Затем можно проанализировать таблицу, содержащую коспонсоров, и добавить всех коспонсоров в список.Вы можете добавить дополнительную обработку здесь, если это необходимо.Затем список объединяется в одну строку, чтобы его можно было сохранить в одном столбце в файле CSV.
from bs4 import BeautifulSoup
import csv
import requests
import string
headers = None
with open('115congress.csv', 'w', newline='') as f:
fwriter = csv.writer(f, delimiter=';')
fwriter.writerow(['SPONS', 'PARTY', 'NBILL', 'TITLE', 'COSPONSORS'])
for j in range(1, 3): #114):
print(f'Getting page {j}')
hrurl = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j)
hrpage = requests.get(hrurl, headers=headers)
soup = BeautifulSoup(hrpage.content, 'lxml')
for li in soup.find_all('li', class_='expanded'):
bill_or_law = li.span.text
sponsor = li.find('span', class_='result-item').a.text
title = li.find('span', class_='result-title').text
nbill = li.find('a').text.strip(string.ascii_uppercase + ' .')
if '[R' in sponsor:
party = 'Republican'
elif '[D' in sponsor:
party = 'Democratic'
else:
party = 'Unknown'
# Any cosponsors?
cosponsor_link = li.find_all('a')[2]
if cosponsor_link.text == '0':
cosponsors = "No cosponsors"
else:
print(f'Getting cosponsors for {sponsor}')
# Get the subpage containing the cosponsors
hr_cosponsors = requests.get(cosponsor_link['href'], headers=headers)
soup_cosponsors = BeautifulSoup(hr_cosponsors.content, 'lxml')
table = soup_cosponsors.find('table', class_="item_table")
# Create a list of the cosponsors
cosponsor_list = []
for tr in table.tbody.find_all('tr'):
cosponsor_list.append(tr.td.a.text)
# Join them together into a single string
cosponsors = ' - '.join(cosponsor_list)
fwriter.writerow([sponsor, party, nbill, f'{bill_or_law} - {title}', cosponsors])
Предоставление выходного файла CSV, начиная с:
SPONS;PARTY;NBILL;TITLE;COSPONSORS
Rep. Ellison, Keith [D-MN-5];Democratic;7401;BILL - Strengthening Refugee Resettlement Act;No cosponsors
Rep. Wild, Susan [D-PA-15];Democratic;7400;BILL - Making continuing appropriations for the Coast Guard.;No cosponsors
Rep. Scanlon, Mary Gay [D-PA-7];Democratic;7399;BILL - Inaugural Fund Integrity Act;No cosponsors
Rep. Foster, Bill [D-IL-11];Democratic;7398;BILL - SPA Act;No cosponsors
Rep. Hoyer, Steny H. [D-MD-5];Democratic;7397;BILL - To provide further additional continuing appropriations for fiscal year 2019, and for other purposes.;No cosponsors
Rep. Torres, Norma J. [D-CA-35];Democratic;7396;BILL - Border Security and Child Safety Act;Rep. Vargas, Juan [D-CA-51]* - Rep. McGovern, James P. [D-MA-2]*
Rep. Meadows, Mark [R-NC-11];Republican;7395;BILL - To direct the Secretary of Health and Human Services to allow delivery of medical supplies by unmanned aerial systems, and for other purposes.;No cosponsors
Rep. Luetkemeyer, Blaine [R-MO-3];Republican;7394;"BILL - To prohibit the Federal financial regulators from requiring compliance with the accounting standards update of the Financial Accounting Standards Board related to current expected credit loss (""CECL""), to require the Securities and Exchange Commission to take certain impacts of a proposed accounting principle into consideration before accepting the principle, and for other purposes.";Rep. Budd, Ted [R-NC-13]*
Rep. Faso, John J. [R-NY-19];Republican;7393;BILL - Medicaid Quality Care Act;No cosponsors
Rep. Babin, Brian [R-TX-36];Republican;7392;BILL - TRACED Act;No cosponsors
Rep. Arrington, Jodey C. [R-TX-19];Republican;7391;BILL - Rural Hospital Freedom and Flexibility Act of 2018;No cosponsors
Rep. Jackson Lee, Sheila [D-TX-18];Democratic;7390;BILL - Violence Against Women Extension Act of 2018;Rep. Hoyer, Steny H. [D-MD-5] - Rep. Clyburn, James E. [D-SC-6]
При использовании csv.writer()
файл всегда должен открываться с параметром newline=''
.Это позволяет избежать получения строк с двойным интервалом в CSV-файле.
Я предлагаю поискать [D
или [R
в тексте, поскольку, вероятно, уже будет D
или R
в остальной частитекст.