Импорт:
from bs4 import BeautifulSoup as soup
import requests as r
import pandas as pd
import re
Получить страницу:
url = 'http://www.crb.state.ri.us/verify_CRB.php?page=0&letter='
data = r.get(url)
page_data = soup(data.text, 'html.parser')
Выберите ссылку:
links = [link.text for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> 32922
# or
links = [link for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> <a href="licensedetail.php?link=32922&type=Resid">32922</a>
# or
links = [link['href'] for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> licensedetail.php?link=32922&type=Resid
# or
links = [r'www.crb.state.ri.us/' + link['href'] for link in page_data.table.tr.find_all('a') if re.search('licensedetail.php', str(link))]
links -> www.crb.state.ri.us/licensedetail.php?link=32922&type=Resid
Отделка:
df = pd.DataFrame(links, columns=['LicenseURL'])
df.to_csv('RI_License_urls.csv', index=False)
Пожалуйста, не забудьте поставить чек рядом с решением.