Невозможно проанализировать эту HTML страницу с помощью BeautifulSoup - PullRequest
0 голосов
/ 10 июля 2020

Я скачал HTML с этой страницы ВРУЧНУЮ ( CTRL + S ):

view-source: https://streeteasy.com/for-sale/nyc/area: 112,115,110,103,117,104,158,113,116,108,109,162,107,106,105,157,121,120,123,122,124,143,141,137? Page = 2

Я загрузил файл. HTML и получил следующий код:

from bs4 import BeautifulSoup

with open('/content/drive/My Drive/Colab Notebooks/Projects/20200710_StreetEasy_WebScraping/a.mhtml') as f:
  contents = f.read()
  #parser
  soup = BeautifulSoup(contents, 'html') #'lxml-xml', 'lxml', 'html5lib', 'html'
print(soup)
одна строка:
<!-- saved from url=(0143)https://streeteasy.com/for-sale/nyc/area:112,115,110,103,117,104,158,113,116,108,109,162,107,106,105,157,121,120,123,122,124,143,141,137?page=2 --><html><head><meta content="text/html; ch

При нахождении всех тегов a это работает:

a=soup.find_all('a')
a
[<a class='3D"html-attribute-value' href='=3D"https://cdn-assets-s3.streeteasy.com/assets/manifest-c93475b02bd2409b4a=' html-resource-link="" noop='ener"' rel='3D"noreferrer' target='3D"_blank"'>//cdn-assets-s3.streeteasy.com/assets/manifest-c93475b02bd2409b4a52e2=
 1af023e5d5f489f19500d234a3660fe4d35069bbac.json</a>,
 <a class='3D"html-attrib=' href='3D"https://browser.sen=' html-resource-link="" noopener="" rel='3D"noreferrer' target='3D"_blank"' try-cdn.com="" ute-value="">https://brows=
 er.sentry-cdn.com/5.19.0/bundle.min.js</a>,
...

При поиске div, скриптов, мета ... все пусто:

div=soup.find_all('div')
div
[]

Это проблема парсинга?

1 Ответ

1 голос
/ 10 июля 2020

Веб-сайт, о котором идет речь, является довольно хорошим веб-сайтом. Я открыл сайт и открыл исходный код просмотра. Я скопировал html и вставил html в файл.

Ссылка - view-source:https://streeteasy.com/for-sale/nyc/area:112,115,110,103,117,104,158,113,116,108,109,162,107,106,105,157,121,120,123,122,124,143,141,137?page=2

Я получил информацию на странице как json.

from bs4 import BeautifulSoup
import json

html = open("html.html").read()
soup = BeautifulSoup(html, "lxml")
json_text =  soup.find("script", {"type":"application/ld+json", "async":"async"}).text.strip()
json_obj = json.loads(json_text[json_text.index("{")-1:-6])

Вывод:

[{'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$3,475,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '15 East 30th Street',
   'postalCode': '10016',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/2/381345902.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$849,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '463 West 57th Street',
   'postalCode': '10019',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/55/394819655.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,475,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '160 West 66th Street',
   'postalCode': '10023',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/7/396195007.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$2,799,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '470 West 24th Street',
   'postalCode': '10011',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/25/396194325.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$795,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '420 East 55th Street',
   'postalCode': '10022',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/29/396194129.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$816,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '258 West 93rd Street',
   'postalCode': '10025',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/34/396194034.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$849,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '464 West 44th Street',
   'postalCode': '10036',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/96/396192696.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,495,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '310 West 52nd Street',
   'postalCode': '10019',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/45/396191645.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$2,725,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '50 Riverside Boulevard',
   'postalCode': '10069',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/48/396190448.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$1,298,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '325 Fifth Avenue',
   'postalCode': '10016',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/31/396187231.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$670,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '303 East 57th Street',
   'postalCode': '10022',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/7/396187207.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$629,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '520 East 76th Street',
   'postalCode': '10021',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/50/396186150.jpg'}},
 {'@context': 'http://schema.org',
  '@type': 'ApartmentComplex',
  'additionalProperty': {'@type': 'PropertyValue', 'value': '$20,500,000'},
  'address': {'@type': 'PostalAddress',
   'addressRegion': 'NY',
   'addressLocality': 'Manhattan',
   'streetAddress': '435 Broome Street',
   'postalCode': '10013',
   'addressCountry': {'@type': 'Country', 'name': 'USA'}},
  'photo': {'@type': 'CreativeWork',
   'image': 'https://cdn-img-feed.streeteasy.com/nyc/image/98/396186098.jpg'}}]
...