Удаление нескольких страниц на веб-странице - PullRequest
1 голос
/ 12 июля 2020

Я пытаюсь извлечь данные из inte rnet. Мой код плавно проходит через первый l oop, печатает и загружает данные в файл, но не печатает данные для следующих страниц. Нет, я использую ноутбук python 3. Вот мой код python.

    import urllib3
    from bs4 import BeautifulSoup as soup
    from time import sleep
    from random import randint
    import pandas as pd
    http = urllib3.PoolManager()

filename = "GautengForSale.csv"
f = open(filename, "w")
headers = "Description, Location, Price, Bedrooms, Bathrooms, Parking, FloorSize\n"
f.write(headers)


for page in range(1, 5):
    
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'
    page_html = http.request('GET', url)
    page_soup = soup(page_html.data)
    containers = page_soup.findAll("div", {"class": "p24_content"})
    
    sleep(randint(2,10))
    
    for container in containers:
        
        description_container = container.findAll("div", {"class": "p24_description"})
        if not description_container:
            continue
        else:
            description = description_container[0].text
    
        location_container = container.findAll("span", {"class": "p24_location"})
        location = location_container[0].text
   
        price_container = container.findAll("div", {"class": "p24_price"})
        price = price_container[0].text.strip()
        
        bedrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container[0].text.strip()
        
        bathrooms_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container[0].text.strip()
        
        parking_container = container.findAll("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container[0].text.strip()
        
        floor_size_container = container.findAll("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container[0].text.strip()

        print(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")
        f.write(str(description) + "," + str(location) + "," + str(price) + "," + str(bedrooms) + "," + str(bathrooms) + "," + str(parking) + "," + str(floor_size) + "\n")

f.close()

Я не уверен, где я ошибся.

Ответы [ 2 ]

1 голос
/ 12 июля 2020

Есть 2 проблемы:

1.) page_soup.findAll("div", {"class": "p24_content"}) должно быть page_soup.select(".p24_content"):, потому что страница изменяет теги <div> и <span> с этим классом

2.) container.findAll("div", {"class": "p24_description"}) должно быть container.select_one(".p24_description, .p24_title"), потому что класс p24_description присутствует только на некоторых страницах

import requests
from bs4 import BeautifulSoup


for page in range(1, 5):
    url = 'https://www.property24.com/for-sale/gauteng/1/p'+str(page)+'?PropertyCategory=House%2cApartmentOrFlat%2cTownhouse'

    page_soup = BeautifulSoup( requests.get(url).content, 'html.parser' )

    for container in page_soup.select(".p24_content"):
        description_container = container.select_one(".p24_description, .p24_title")
        if not description_container:
            continue
        else:
            description = description_container.get_text(strip=True)

        location_container = container.select_one(".p24_location")
        location = location_container.get_text(strip=True)

        price_container = container.select_one(".p24_price")
        price = price_container.text.strip()

        bedrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bedrooms"})
        if not bedrooms_container:
            bedrooms = 0
        else:
            bedrooms = bedrooms_container.text.strip()

        bathrooms_container = container.find("span", {"class": "p24_featureDetails", "title": "Bathrooms"})
        if not bathrooms_container:
            bathrooms = 1
        else:
            bathrooms = bathrooms_container.text.strip()

        parking_container = container.find("span", {"class": "p24_featureDetails", "title": "Parking Spaces"})
        if not parking_container:
            parking = 0
        else:
            parking = parking_container.text.strip()

        floor_size_container = container.find("span", {"class": "p24_size", "title": "Floor Size"})
        if not floor_size_container:
            floor_size = "n/a"
        else:
            floor_size = floor_size_container.text.strip()

        print('{},{},{},{},{},{},{}'.format(description, location, price, bedrooms, bathrooms, parking, floor_size))

Печатает:

5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
1 Bedroom Apartment inGrand Central,Grand Central,R 450 000,1,1,0,n/a
5 Bedroom House inWilro Park,Wilro Park,R 1 595 000,5,3,4,n/a
1 Bedroom Apartment inProtea Glen,Protea Glen,R 413 000,1,1,0,n/a
3 Bedroom Townhouse inWillowbrook,Willowbrook,R 1 350 000,3,2,4,n/a
2 Bedroom Apartment inWinchester Hills,Winchester Hills,R 650 000,2,1,1,69 m²
2 Bedroom Townhouse inElarduspark,Elarduspark,R 960 000,2,2,2,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
2 Bedroom Townhouse inProtea Glen,Protea Glen,R 565 000,2,1,1,50 m²
4 Bedroom House inSunninghill,Sunninghill,R 3 245 000,4,3.5,1,240 m²
1 Bedroom Apartment inRandpark Ridge,Randpark Ridge,R 807 700,1,1,1,51 m²
3 Bedroom House inGlenvista,Glenvista,R 2 500 000,3,2,3,n/a
4 Bedroom House inMeyersdal Nature Estate,Meyersdal Nature Estate,R 2 695 000,4,3,2,n/a
House,Geduld,R 750 000,0,1,0,n/a
3 Bedroom House,The Orchards,R 750 000,3,2,1,n/a
1 Bedroom Apartment,Kempton Park Central,POA,1,1,1,n/a
Apartment,Fourways,R 889 000,0,1,0,n/a
2 Bedroom Townhouse,Highveld,R 1 195 000,2,1.5,1,n/a
3 Bedroom House,Delville,R 1 300 000,3,1,5,n/a
5 Bedroom House,Northcliff,R 3 450 000,5,3.5,6,n/a
1 Bedroom House,Langaville,R 180 000,1,1,0,n/a
1 Bedroom House,Vlakfontein,R 170 000,1,1,1,n/a
5 Bedroom Townhouse inFourways,Fourways,R 5 890 000,5,5.5,2,457 m²
3 Bedroom Apartment,Andeon,R 860 000,3,2,2,n/a
2 Bedroom Apartment,Vereeniging Central,R 435 000,2,1.5,1,77 m²
3 Bedroom House,Eldoraigne,R 1 750 000,3,2,3,n/a
3 Bedroom House,Moreleta Park,R 2 990 000,3,2.5,2,n/a
2 Bedroom Apartment,Kyalami Hills,R 1 235 000,2,2,1,97 m²

... and so on.
1 голос
/ 12 июля 2020

Похоже, что класс p24_content применяется к тегу span, начиная со второй страницы. Решение может быть таким:

containers = page_soup.findAll(["div", "span"], {"class": "p24_content"})

... если я правильно прочитал bs4 документацию .

Может быть, изменится еще больше вещей. Не проверял :)

...