Результаты выскабливания отличаются, когда у всех oop Python / BS4 / Selenium - PullRequest
0 голосов
/ 20 апреля 2020

У меня есть CSV-файл, содержащий ссылки, которые мне нужно почистить. У меня также есть настройка для использования того же браузера chrome для входа в систему (нужные мне элементы доступны только при входе в систему). Когда я очищаю одну страницу за пределами l oop, я получаю нужные результаты со страницы. Когда я помещаю один и тот же код в al oop, чтобы очистить все ссылки, я получаю разные результаты. Я думаю, что это связано с "source =" и или "soup ="

CSV-файл содержит 3 ссылки:

https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987

Код одной страницы:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
address = soup.find('span', class_='street-address').text
print("      Address: " + address)
city = soup.find('span', class_='locality').text
print("         City: " + city)
state = soup.find('span', class_='region').text
print("        State: " + state)
zipcode = soup.find('span', class_='postal-code').text
print("      ZipCode: " + zipcode)
soldPrice = soup.find('div', class_='price-col number').text
print("   Sold Price: " + soldPrice)
ln = soup.find('div', class_='listing-agent-item')
Lname = ln.find_all('span')[1].text
print("Listing Agent: " + Lname)
bn = soup.find('div', class_='buyer-agent-item')
Bname = bn.find_all('span')[1].text
print(" Buying Agent: " + Bname)
date = soup.find('div',attrs={"class":"col-4"})
sDate = date.find_all('p')[0].text
print("         Date: " + sDate)
mls = soup.find('div', class_='sourceContent').text
print("   MLS Source: " + mls)
for span in soup.find_all('span'):
    if span.find(text='MLS#'):
            mlsNum = span.nextSibling.text
print("         MLS#: " + mlsNum)
driver.quit()

Результаты на одной странице отображаются отлично:

      Address: 4551 S 200 E 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: $262,000 
Listing Agent: Jerold Ivie
 Buying Agent: Zac Eldridge
         Date: Dec 20, 2019
   MLS Source: WFRMLS
         MLS#: 1635000
[Finished in 3.3s]

Код для L oop с 'source =' и 'driver =' перед l oop:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import csv
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987")
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
    readCSV = csv.reader(file)
    for row in readCSV:
        url = str(row).replace("['","").replace("']","")
        print("_________________________________")
        print("Scraping: " + url)        
        driver.get(url)
        #source = driver.page_source
        #soup = BeautifulSoup(source, "html.parser")
####################################
        try:
                address = soup.find('span', class_='street-address').text
                print("      Address: " + address)
        except:
                print("      Address: " + "NA")
        try:
                city = soup.find('span', class_='locality').text
                print("         City: " + city)
        except:
                print("         City: " + "NA")
        try:
                state = soup.find('span', class_='region').text
                print("        State: " + state)
        except:
                print("        State: " + "NA")
        try:
                zipcode = soup.find('span', class_='postal-code').text
                print("      ZipCode: " + zipcode)
        except:
                print("      ZipCode: " + "NA")
        try:
                soldPrice = soup.find('div', class_='price-col number').text
                print("   Sold Price: " + soldPrice)
        except:
                print("   Sold Price: " "NA")            
        try:
                ln = soup.find('div', class_='listing-agent-item')
                Lname = ln.find_all('span')[1].text
                print("Listing Agent: " + Lname)
        except:
                print("Listing Agent: " + "NA")
        try:
                bn = soup.find('div', class_='buyer-agent-item')
                Bname = bn.find_all('span')[1].text
                print(" Buying Agent: " + Bname)
        except:
                print(" Buying Agent: " + "NA")
        try:
                date = soup.find('div',attrs={"class":"col-4"})
                sDate = date.find_all('p')[0].text
                print("         Date: " + sDate)
        except:
                print("         Date: " + "NA")
        try:
                mls = soup.find('div', class_='sourceContent').text
                print("   MLS Source: " + mls)
        except:
                print("   MLS Source: " + "NA")
        try:
                for span in soup.find_all('span'):
                        if span.find(text='MLS#'):
                                mlsNum = span.nextSibling.text
                print("         MLS#: " + mlsNum)
        except:
                print("         MLS#: " + "NA")

Результаты с L oop: Вы можете увидеть, как он печатает URL-адреса из файла, затем очищает результаты текущего открытого браузера 3 раза ... но захватывает всю информацию, необходимую для открытого URL-адреса.

_________________________________
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
      Address: 4551 S 200 E 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: $262,000 
Listing Agent: Jerold Ivie
 Buying Agent: Zac Eldridge
         Date: Dec 20, 2019
   MLS Source: WFRMLS
         MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
      Address: 4551 S 200 E 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: $262,000 
Listing Agent: Jerold Ivie
 Buying Agent: Zac Eldridge
         Date: Dec 20, 2019
   MLS Source: WFRMLS
         MLS#: 1635000
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
      Address: 4551 S 200 E 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: $262,000 
Listing Agent: Jerold Ivie
 Buying Agent: Zac Eldridge
         Date: Dec 20, 2019
   MLS Source: WFRMLS
         MLS#: 1635000
[Finished in 6.9s]

Если я поставлю 'source =' и 'soup =' в l oop:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
#####################################
chrome_driver = "C:/chromedriver.exe"
Chrome_options = Options()
Chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9015")
driver = webdriver.Chrome(chrome_driver, options=Chrome_options)
#####################################
#driver.get("https://www.redfin.com/UT/Murray/4551-S-200-E- 84107/home/86457987")
#source = driver.page_source
#soup = BeautifulSoup(source, "html.parser")
#####################################
with open('UTlinks.csv') as file:
    readCSV = csv.reader(file)
    for row in readCSV:
        url = str(row).replace("['","").replace("']","")
        print("_________________________________")
        print("Scraping: " + url)        
        driver.get(url)
        source = driver.page_source
        soup = BeautifulSoup(source, "html.parser")
####################################
        try:
                address = soup.find('span', class_='street-address').text
                print("      Address: " + address)
        except:
                print("      Address: " + "NA")
        try:
                city = soup.find('span', class_='locality').text
                print("         City: " + city)
        except:
                print("         City: " + "NA")
        try:
                state = soup.find('span', class_='region').text
                print("        State: " + state)
        except:
                print("        State: " + "NA")
        try:
                zipcode = soup.find('span', class_='postal-code').text
                print("      ZipCode: " + zipcode)
        except:
                print("      ZipCode: " + "NA")
        try:
                soldPrice = soup.find('div', class_='price-col number').text
                print("   Sold Price: " + soldPrice)
        except:
                print("   Sold Price: " "NA")            
        try:
                ln = soup.find('div', class_='listing-agent-item')
                Lname = ln.find_all('span')[1].text
                print("Listing Agent: " + Lname)
        except:
                print("Listing Agent: " + "NA")
        try:
                bn = soup.find('div', class_='buyer-agent-item')
                Bname = bn.find_all('span')[1].text
                print(" Buying Agent: " + Bname)
        except:
                print(" Buying Agent: " + "NA")
        try:
                date = soup.find('div',attrs={"class":"col-4"})
                sDate = date.find_all('p')[0].text
                print("         Date: " + sDate)
        except:
                print("         Date: " + "NA")
        try:
                mls = soup.find('div', class_='sourceContent').text
                print("   MLS Source: " + mls)
        except:
                print("   MLS Source: " + "NA")
        try:
                for span in soup.find_all('span'):
                        if span.find(text='MLS#'):
                                mlsNum = span.nextSibling.text
                print("         MLS#: " + mlsNum)
        except:
                print("         MLS#: " + "NA")

'source =' & 'soup =' в l oop результаты:


Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
      Address: 875 E Arrow Head Ln S #44 
         City: Salt Lake City, 
        State: UT
      ZipCode: 84107
   Sold Price: NA
Listing Agent: Joe Olschewski
 Buying Agent: James Corey
         Date: NA
   MLS Source: WFRMLS
         MLS#: 1654937
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave- 84107/home/86446505
      Address: 35 American Ave 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: NA
Listing Agent: Dana Conway
 Buying Agent: Rich Varga
         Date: NA
   MLS Source: WFRMLS
         MLS#: 1660023
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
      Address: 4551 S 200 E 
         City: Murray, 
        State: UT
      ZipCode: 84107
   Sold Price: NA
Listing Agent: Jerold Ivie
 Buying Agent: Zac Eldridge
         Date: NA
   MLS Source: WFRMLS
         MLS#: 1635000
[Finished in 8.6s]

Теперь он работает нормально, но не захватывает «Sold Price:» или «Sold Date:». Если я отключу обработку ошибок, она выдаст эту ошибку:

soldPrice = soup.find('div', class_='price-col number').text
AttributeError: 'NoneType' object has no attribute 'text'

Что я здесь не так делаю?

1 Ответ

0 голосов
/ 21 апреля 2020

Существует API для получения данных. Хотя немного хитро. Я могу получить данные без регистрации / входа (хотя единственное, что я не смог найти в ответах html или json, это агенты по покупкам по какой-то причине). Но если вы войдете в систему, то, похоже, предоставите эти данные. Похоже, все остальное (и БОЛЬШЕ) есть.

import requests
from bs4 import BeautifulSoup
import json

links = ['https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264',
'https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505',
'https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987']

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}

payload = {
'email':'username@email.com',
'pwd':'thisIsThePassword'}

#cookiesStr = ''
with requests.Session() as s:       
    login = s.post('https://www.redfin.com/stingray/do/api-login', headers=headers, params=payload)
    #cookies = s.cookies.get_dict()
    #for k, v in cookies.items():
    #    cookiesStr += '%s=%s;' %(k,v)

#headers.update({'cookie':cookiesStr}) 

for url in links:
    response = s.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    scripts = soup.find_all('script')
    for script in scripts:
        if '_tLAB.wait(function()' in script.text and '/stingray/api/home/details/belowTheFold' in script.text:
            jsonStr = script.text
            jsonStr = '{' + jsonStr.split('{',2)[-1].rsplit(')',2)[0]

            jsonData2 = json.loads(jsonStr)
            jsonData2 = json.loads(jsonData2['res']['text'].split('&&')[-1])

            address = jsonData2['payload']['amenitiesInfo']['addressInfo']['street']
            city = jsonData2['payload']['amenitiesInfo']['addressInfo']['city']
            state = jsonData2['payload']['amenitiesInfo']['addressInfo']['state']
            zipcode = jsonData2['payload']['amenitiesInfo']['addressInfo']['zip']

            mlsSource = jsonData2['payload']['amenitiesInfo']['provider']
            listingAgent = jsonData2['payload']['amenitiesInfo']['mlsDisclaimerInfo']['listingAgentName']

        if 'InitialContext = ' in script.text:
            jsonStr = script.text.split('InitialContext = ')[-1].split('root.__reactServerState.Config')[0].rsplit(';',1)[0]
            jsonData = json.loads(jsonStr)

            dataAPIs = jsonData['ReactServerAgent.cache']['dataCache']
            jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/aboveTheFold']['res']['text'].split('&&')[-1])
            soldPrice = jsonData2['payload']['addressSectionInfo']['priceInfo']['amount']
            soldDate = jsonData2['payload']['mediaBrowserInfo']['sashes'][0]['lastSaleDate']

            jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/initialInfo']['res']['text'].split('&&')[-1])
            mls = jsonData2['payload']['mlsId']

            jsonData2 = json.loads(dataAPIs['/stingray/api/home/details/mainHouseInfoPanelInfo']['res']['text'].split('&&')[-1])
            buyingAgents = jsonData2['payload']['mainHouseInfo']['buyingAgents'][0]['agentInfo']['agentName']

    print("_________________________________")
    print("Scraping: " + url)  
    print('%15s: %s' %('Address',address))
    print('%15s: %s' %('City',city))
    print('%15s: %s' %('State',state))
    print('%15s: %s' %('Zipcode',zipcode))
    print('%15s: $' %('Sold Price') + f'{soldPrice:,}')
    print('%15s: %s' %('Listing Agent',listingAgent))
    print('%15s: %s' %('Buying Agent',buyingAgents))
    print('%15s: %s' %('Date',soldDate))
    print('%15s: %s' %('MLS Source',mlsSource))
    print('%15s: %s' %('MLS#',mls))

Вывод:

_________________________________
Scraping: https://www.redfin.com/UT/Murray/875-E-Arrowhead-Ln-84107/unit-44/home/77418264
        Address: 875 E Arrow Head Lane South Unit 44
           City: Salt Lake City
          State: UT
        Zipcode: 84107
     Sold Price: $179,900
  Listing Agent: Joe Olschewski
   Buying Agent: James Corey
           Date: MAR 12, 2020
     MLS Source: WFRMLS
           MLS#: 1654937
_________________________________
Scraping: https://www.redfin.com/UT/Murray/35-W-American-Ave-84107/home/86446505
        Address: 35 American Ave
           City: Murray
          State: UT
        Zipcode: 84107
     Sold Price: $317,500
  Listing Agent: Dana Conway
   Buying Agent: Rich Varga
           Date: MAR 25, 2020
     MLS Source: WFRMLS
           MLS#: 1660023
_________________________________
Scraping: https://www.redfin.com/UT/Murray/4551-S-200-E-84107/home/86457987
        Address: 4551 South 200 E
           City: Murray
          State: UT
        Zipcode: 84107
     Sold Price: $262,000
  Listing Agent: Jerold Ivie
   Buying Agent: Zac Eldridge
           Date: DEC 20, 2019
     MLS Source: WFRMLS
           MLS#: 1635000
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...