У меня есть CSV-файл, содержащий ссылки, которые мне нужно почистить. У меня также есть настройка для использования того же браузера chrome для входа в систему (нужные мне элементы доступны только при входе в систему). Когда я очищаю одну страницу за пределами l oop, я получаю нужные результаты со страницы. Когда я помещаю один и тот же код в al oop, чтобы очистить все ссылки, я получаю разные результаты. Я думаю, что это связано с "source =" и или "soup ="
CSV-файл содержит 3 ссылки:
https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Код одной страницы:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
address = soup.find('span', class_='street-address').text
print(" Address: " + address)
city = soup.find('span', class_='locality').text
print(" City: " + city)
state = soup.find('span', class_='region').text
print(" State: " + state)
zipcode = soup.find('span', class_='postal-code').text
print(" ZipCode: " + zipcode)
soldPrice = soup.find('div', class_='price-col number').text
print(" Sold Price: " + soldPrice)
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print(" Date: " + sDate)
mls = soup.find('div', class_='sourceContent').text
print(" MLS Source: " + mls)
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print(" MLS#: " + mlsNum)
driver.quit()
Результаты на одной странице отображаются отлично:
Address: 4551 S 200 E
City: Murray,
State: UT
ZipCode: 84107
Sold Price: $262,000
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 3.3s]
Код для L oop с 'source =' и 'driver =' перед l oop:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import csv
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
readCSV = csv.reader(file)
for row in readCSV:
url = str(row).replace("['","").replace("']","")
print("_________________________________")
print("Scraping: " + url)
driver.get(url)
#source = driver.page_source
#soup = BeautifulSoup(source, "html.parser")
####################################
try:
address = soup.find('span', class_='street-address').text
print(" Address: " + address)
except:
print(" Address: " + "NA")
try:
city = soup.find('span', class_='locality').text
print(" City: " + city)
except:
print(" City: " + "NA")
try:
state = soup.find('span', class_='region').text
print(" State: " + state)
except:
print(" State: " + "NA")
try:
zipcode = soup.find('span', class_='postal-code').text
print(" ZipCode: " + zipcode)
except:
print(" ZipCode: " + "NA")
try:
soldPrice = soup.find('div', class_='price-col number').text
print(" Sold Price: " + soldPrice)
except:
print(" Sold Price: " "NA")
try:
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
except:
print("Listing Agent: " + "NA")
try:
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
except:
print(" Buying Agent: " + "NA")
try:
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print(" Date: " + sDate)
except:
print(" Date: " + "NA")
try:
mls = soup.find('div', class_='sourceContent').text
print(" MLS Source: " + mls)
except:
print(" MLS Source: " + "NA")
try:
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print(" MLS#: " + mlsNum)
except:
print(" MLS#: " + "NA")
Результаты с L oop: Вы можете увидеть, как он печатает URL-адреса из файла, затем очищает результаты текущего открытого браузера 3 раза ... но захватывает всю информацию, необходимую для открытого URL-адреса.
_________________________________
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
Address: 4551 S 200 E
City: Murray,
State: UT
ZipCode: 84107
Sold Price: $262,000
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
Address: 4551 S 200 E
City: Murray,
State: UT
ZipCode: 84107
Sold Price: $262,000
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Address: 4551 S 200 E
City: Murray,
State: UT
ZipCode: 84107
Sold Price: $262,000
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: Dec 20, 2019
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 6.9s]
Если я поставлю 'source =' и 'soup =' в l oop:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E- 84107/home/86457987")
#source = driver.page_source
#soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
readCSV = csv.reader(file)
for row in readCSV:
url = str(row).replace("['","").replace("']","")
print("_________________________________")
print("Scraping: " + url)
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
####################################
try:
address = soup.find('span', class_='street-address').text
print(" Address: " + address)
except:
print(" Address: " + "NA")
try:
city = soup.find('span', class_='locality').text
print(" City: " + city)
except:
print(" City: " + "NA")
try:
state = soup.find('span', class_='region').text
print(" State: " + state)
except:
print(" State: " + "NA")
try:
zipcode = soup.find('span', class_='postal-code').text
print(" ZipCode: " + zipcode)
except:
print(" ZipCode: " + "NA")
try:
soldPrice = soup.find('div', class_='price-col number').text
print(" Sold Price: " + soldPrice)
except:
print(" Sold Price: " "NA")
try:
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
except:
print("Listing Agent: " + "NA")
try:
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
except:
print(" Buying Agent: " + "NA")
try:
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print(" Date: " + sDate)
except:
print(" Date: " + "NA")
try:
mls = soup.find('div', class_='sourceContent').text
print(" MLS Source: " + mls)
except:
print(" MLS Source: " + "NA")
try:
for span in soup.find_all('span'):
if span.find(text='MLS#'):
mlsNum = span.nextSibling.text
print(" MLS#: " + mlsNum)
except:
print(" MLS#: " + "NA")
'source =' & 'soup =' в l oop результаты:
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
Address: 875 E Arrow Head Ln S #44
City: Salt Lake City,
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Joe Olschewski
Buying Agent: James Corey
Date: NA
MLS Source: WFRMLS
MLS#: 1654937
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave- 84107/home/86446505
Address: 35 American Ave
City: Murray,
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Dana Conway
Buying Agent: Rich Varga
Date: NA
MLS Source: WFRMLS
MLS#: 1660023
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
Address: 4551 S 200 E
City: Murray,
State: UT
ZipCode: 84107
Sold Price: NA
Listing Agent: Jerold Ivie
Buying Agent: Zac Eldridge
Date: NA
MLS Source: WFRMLS
MLS#: 1635000
[Finished in 8.6s]
Теперь он работает нормально, но не захватывает «Sold Price:» или «Sold Date:». Если я отключу обработку ошибок, она выдаст эту ошибку:
soldPrice = soup.find('div', class_='price-col number').text
AttributeError: 'NoneType' object has no attribute 'text'
Что я здесь не так делаю?