Почему BeautifulSoup не очищает всю веб-страницу? - PullRequest
4 голосов
/ 07 мая 2019

Предпосылка: я совершенно новичок в Python и веб-поиске.Я пытаюсь собрать данные о брендах на этой странице: https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/, но BeautifulSoup извлекает html только до определенного момента.Там нет ничего странного в html, так как перед ним есть пять почти одинаковых тегов, которые BeautifulSoup извлекает без проблем.

Я уже пробовал использовать три разных анализатора (встроенный, lxml и html5lib), но я всегда получаю один и тот же результат.

Вот код:

import requests
page = requests.get("https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content , 'html5lib')
print(soup.prettify())

1 Ответ

1 голос
/ 07 мая 2019

Используйте Css selecor для получения вывода.

from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.interbrand.com/best-brands/best-global-brands/2018/ranking/")
soup = BeautifulSoup(page.content , 'lxml')
Brand=[]
Country=[]
Region=[]
Sector=[]
for brnd in soup.select('div.brand-name'):
    Brand.append(brnd['title'])

for region in soup.select('div.brand-region'):
    Region.append(region['title'])

for county in soup.select('div.brand-country'):
    Country.append(county['title'])

for sector in soup.select('div.brand-sector'):
    Sector.append(sector['title'])

print(Brand)
print(Region)
print(Country)
print(Sector)

Вывод:

['Brand name: Apple', 'Brand name: Google', 'Brand name: Amazon', 'Brand name: Microsoft', 'Brand name: Coca-Cola', 'Brand name: Samsung', 'Brand name: Toyota', 'Brand name: Mercedes-Benz', 'Brand name: Facebook', "Brand name: McDonald's", 'Brand name: Intel', 'Brand name: IBM', 'Brand name: BMW', 'Brand name: Disney', 'Brand name: Cisco', 'Brand name: GE', 'Brand name: Nike', 'Brand name: Louis Vuitton', 'Brand name: Oracle', 'Brand name: Honda', 'Brand name: SAP', 'Brand name: Pepsi', 'Brand name: Chanel', 'Brand name: American Express', 'Brand name: Zara', 'Brand name: J.P. Morgan', 'Brand name: IKEA', 'Brand name: Gillette', 'Brand name: UPS', 'Brand name: H&M', 'Brand name: Pampers', 'Brand name: Hermès', 'Brand name: Budweiser', 'Brand name: Accenture', 'Brand name: Ford', 'Brand name: Hyundai', 'Brand name: NESCAFÉ', 'Brand name: eBay', 'Brand name: Gucci', 'Brand name: Nissan', 'Brand name: Volkswagen', 'Brand name: Audi', 'Brand name: Philips', 'Brand name: Goldman Sachs', 'Brand name: Citi', 'Brand name: HSBC', 'Brand name: AXA', "Brand name: L'Oréal", 'Brand name: Allianz', 'Brand name: adidas', 'Brand name: Adobe', 'Brand name: Porsche', "Brand name: Kellogg's", 'Brand name: HP', 'Brand name: Canon', 'Brand name: Siemens', 'Brand name: Starbucks', 'Brand name: Danone', 'Brand name: Sony', 'Brand name: 3M', 'Brand name: Visa', 'Brand name: Nestlé', 'Brand name: Morgan Stanley', 'Brand name: Colgate', 'Brand name: Hewlett Packard Enterprise', 'Brand name: Netflix', 'Brand name: Cartier', 'Brand name: Huawei', 'Brand name: Banco Santander', 'Brand name: Mastercard', 'Brand name: Kia', 'Brand name: FedEx', 'Brand name: PayPal', 'Brand name: LEGO', 'Brand name: Salesforce.com', 'Brand name: Panasonic', 'Brand name: Johnson & Johnson', 'Brand name: Land Rover', 'Brand name: DHL', 'Brand name: Ferrari', 'Brand name: Discovery', 'Brand name: Caterpillar', 'Brand name: Tiffany & Co.', "Brand name: Jack Daniel's", 'Brand name: Corona', 'Brand name: KFC', 'Brand name: Heineken', 'Brand name: John Deere', 'Brand name: Shell', 'Brand name: MINI', 'Brand name: Dior', 'Brand name: Spotify', 'Brand name: Harley-Davidson', 'Brand name: Burberry', 'Brand name: Prada', 'Brand name: Sprite', 'Brand name: Johnnie Walker', 'Brand name: Hennessy', 'Brand name: Nintendo', 'Brand name: Subaru']
['Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Asia Pacific', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: The Americas', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: The Americas', 'Region: Europe & Africa', 'Region: Europe & Africa', 'Region: Asia Pacific', 'Region: Asia Pacific']
['Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: South Korea', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: France', 'Country: United States', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: France', 'Country: United States', 'Country: Spain', 'Country: United States', 'Country: Sweden', 'Country: United States', 'Country: United States', 'Country: Sweden', 'Country: United States', 'Country: France', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: South Korea', 'Country: Switzerland', 'Country: United States', 'Country: Italy', 'Country: Japan', 'Country: Germany', 'Country: Germany', 'Country: Netherlands', 'Country: United States', 'Country: United States', 'Country: United Kingdom', 'Country: France', 'Country: France', 'Country: Germany', 'Country: Germany', 'Country: United States', 'Country: Germany', 'Country: United States', 'Country: United States', 'Country: Japan', 'Country: Germany', 'Country: United States', 'Country: France', 'Country: Japan', 'Country: United States', 'Country: United States', 'Country: Switzerland', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: France', 'Country: China', 'Country: Spain', 'Country: United States', 'Country: South Korea', 'Country: United States', 'Country: United States', 'Country: Denmark', 'Country: United States', 'Country: Japan', 'Country: United States', 'Country: United Kingdom', 'Country: United States', 'Country: Italy', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: United States', 'Country: Mexico', 'Country: United States', 'Country: Netherlands', 'Country: United States', 'Country: Netherlands', 'Country: United Kingdom', 'Country: France', 'Country: Sweden', 'Country: United States', 'Country: United Kingdom', 'Country: Italy', 'Country: United States', 'Country: United Kingdom', 'Country: France', 'Country: Japan', 'Country: Japan']
['Sector: Technology', 'Sector: Technology', 'Sector: Retail', 'Sector: Technology', 'Sector: Beverages', 'Sector: Technology', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Technology', 'Sector: Restaurants', 'Sector: Technology', 'Sector: Business Services', 'Sector: Automotive', 'Sector: Media', 'Sector: Technology', 'Sector: Diversified', 'Sector: Sporting Goods', 'Sector: Luxury', 'Sector: Technology', 'Sector: Automotive', 'Sector: Technology', 'Sector: Beverages', 'Sector: Luxury', 'Sector: Financial Services', 'Sector: Apparel', 'Sector: Financial Services', 'Sector: Retail', 'Sector: FMCG', 'Sector: Logistics', 'Sector: Apparel', 'Sector: FMCG', 'Sector: Luxury', 'Sector: Alcohol', 'Sector: Business Services', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Beverages', 'Sector: Retail', 'Sector: Luxury', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Automotive', 'Sector: Electronics', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Financial Services', 'Sector: Sporting Goods', 'Sector: Technology', 'Sector: Automotive', 'Sector: FMCG', 'Sector: Technology', 'Sector: Electronics', 'Sector: Diversified', 'Sector: Restaurants', 'Sector: FMCG', 'Sector: Electronics', 'Sector: Diversified', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Technology', 'Sector: Media', 'Sector: Luxury', 'Sector: Technology', 'Sector: Financial Services', 'Sector: Financial Services', 'Sector: Automotive', 'Sector: Logistics', 'Sector: Financial Services', 'Sector: FMCG', 'Sector: Business Services', 'Sector: Electronics', 'Sector: FMCG', 'Sector: Automotive', 'Sector: Logistics', 'Sector: Automotive', 'Sector: Media', 'Sector: Diversified', 'Sector: Luxury', 'Sector: Alcohol', 'Sector: Alcohol', 'Sector: Restaurants', 'Sector: Alcohol', 'Sector: Diversified', 'Sector: Energy', 'Sector: Automotive', 'Sector: Luxury', 'Sector: Media', 'Sector: Automotive', 'Sector: Luxury', 'Sector: Luxury', 'Sector: Beverages', 'Sector: Alcohol', 'Sector: Alcohol', 'Sector: Electronics', 'Sector: Automotive']
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...