Проблема извлечения релевантных данных из файла html-страницы википедии с использованием красивого супа - PullRequest
0 голосов
/ 02 мая 2019

Я поцарапал страницу Википедии "Кухни Нью-Йорка", используя красивый суп.Теперь у меня возникают проблемы с извлечением нужных мне данных.

Требуемый вывод должен выглядеть примерно так:

Place1             Place2               Cuisine

The Bronx        Bedfort Park      Mexican, Mexican, Puerto Rican, Dominican
.
.
.
Manhattan       Upper East Side    German, Czech, Hungarian

Код:

html = wp.page("Cuisine_of_New_York_City").html().encode("UTF-8")
soup = BeautifulSoup(html, 'lxml')

article = soup.find('div', class_ = "div-col columns column-width")
array = article.text.split('\n')[1:len(article.text.split('\n'))-1]
array

Я пыталсяэто, но я получил только первую запись, которую я искал

Ответы [ 2 ]

1 голос
/ 02 мая 2019

Вам просто нужно изменить метод find. Используйте find_all вместо:

from bs4 import BeautifulSoup
import requests

page = requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City')

soup = BeautifulSoup(page.text, 'html.parser')

articles = soup.find_all('div', class_ = "div-col columns column-width")
for article in articles:
    array = article.text.split('\n')[1:len(article.text.split('\n'))-1]
    print(array)

ВЫВОД:

['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on 204th St.)', 'Belmont – Italian, Albanian (also known as "Arthur Avenue," "Little Italy")', 'City Island – Italian, Seafood', 'Morris Park – Italian, Albanian', 'Norwood – Filipino (formerly Irish, less so today)', 'Riverdale – Jewish', 'South Bronx – Puerto Rican, Dominican', 'Wakefield – Jamaican, West Indian', 'Woodlawn – Irish']
['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian and other Arabic', 'Bellerose – Indian and Pakistani', 'Flushing – Chinese and Korean', 'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and Uzbek', 'Howard Beach; Ozone Park – Italian', 'Glendale – German and Polish', 'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, Ecuadorian, Peruvian, Korean, Filipino and Mexican', 'Jamaica – Bangladeshi, Caribbean; African-American; African; Creole', 'Little Neck – Arab, Chinese, and Italian', 'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, Bangladeshi', 'The Rockaways - Irish, Jewish', 'Woodhaven – Irish, Dominican, Mexican, Guyanese', 'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian']
['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, Palestinian, Yemeni and other Arabic', 'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, Puerto Rican and West Indian', 'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, Uzbek', 'Borough Park – Jewish, Italian, Mexican, Chinese', 'Brighton Beach – Russian, Georgian, Turkish, Pakistani and Ukrainian', 'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian', 'Canarsie – Jamaican, West Indian, African-American', 'Carroll Gardens – Italian', 'Crown Heights – Jamaican, West Indian, and Jewish', 'East New York – African-American, Dominican, and Puerto Rican', 'Flatbush – Jamaican, Haitian, and Creole', 'Greenpoint – Polish and Ukrainian', 'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish', 'Midwood – Jewish, Italian, Russian, and Pakistani', 'Park Slope – Italian, Irish, French, and Puerto Rican (formerly)', 'Red Hook – Puerto Rican, African-American, and Italian', 'Sheepshead Bay – Seafood, Russian, and Italian', 'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian', 'Williamsburg – Italian, Jewish, Dominican and Puerto Rican']
['Chinatown – Chinese and Vietnamese', 'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban and Italian', 'East Village – Japanese, Korean, Indian and Ukrainian', 'Greenwich Village –  Italian', 'Harlem – Italian, African-American, Latin American, West Indian, and West African', 'Koreatown – Korean', 'Little Italy – Italian', 'Lower East Side – Puerto Rican, Jewish, Italian, and Latin American', 'Murray Hill – Indian, Pakistani and Bangladeshi', 'Washington Heights – Dominican, Puerto Rican, Italian and Jewish', 'Upper East Side – German, Czech, Hungarian']
['Manhattan clam chowder', 'New York-style cheesecake', 'New York-style pizza', 'New York-style bagel', 'New York-style pastrami', 'Corned beef[4]', 'Baked pretzels', 'New York-style Italian ice', 'Knish', 'Eggs Benedict', 'Chopped Cheese', 'Lobster Newberg', 'Waldorf Salad', 'Doughnut', 'Delmonico steak', 'Black and white cookie', 'Bacon, egg and cheese sandwich on a roll']
['celery soda', 'New York-style pastrami, pastrami on rye', 'brisket[4]', 'corned beef[4]', 'tongue', 'knish[4]', 'New York-style bagels and lox (see also: appetizing)[4]', 'Bagel and cream cheese', 'cream cheese', 'whitefish with and without pike', 'Gefilte fish', 'blintzes[4]', 'potato pancake', 'bialy[4]', 'challah bread', 'matzo', 'egg cream', 'pickled cucumbers (especially dill pickles)', 'kishka', 'potato kugel', 'chopped chicken liver', 'matzo ball soup', 'lokshen soup']
['Bloody Mary', 'Chef salad', 'Chicken à la King[13]', 'Chicken and waffles', 'Chicken Divan', 'Cronut', 'Delmonico steak', 'Egg cream', 'Eggs Benedict', "General Tso's chicken", 'Ice cream cone', 'Lobster Newburg', 'Mallomars[14]', 'Manhattan', 'Manhattan Special – A type of carbonated espresso drink.', 'Pasta primavera', 'Penne alla Vodka', 'Reuben sandwich', 'Steak Diane', 'Spaghetti and meatballs', 'Vichyssoise', 'Waldorf salad']
['arepas', 'calzones', 'Chinese kebabs (chuanr)', 'churros', 'cuchifritos', 'dumplings', 'falafel', 'fried chicken', 'fried noodles', "Gray's Papaya, Papaya King – combined papaya juice/hot dog stands", 'corndogs', 'grilled chestnuts[3]', 'gyros/shawarma', 'Halal chicken/lamb over rice[15]', 'hamburgers', 'honey-roasted peanuts, almonds, cashews, and coconut', 'hot dog stands', 'Italian ice', 'Italian sausage, bratwurst', 'knishes', 'Mister Softee ice cream', 'muffins', 'piragua', 'pizza, especially New York-style pizza', 'soft pretzels[3]', 'souvlaki/shish kebab', 'stromboli', 'tacos', 'take-out soup, as Soup Kitchen International']
['A&P', 'AriZona Beverage Company', "Balducci's", "Bamonte's", 'Benihana', 'Blimpie', 'C-Town Supermarkets', 'Caffe Reggio - the first espresso bar to introduce cappuccino in America', 'Carnegie Deli', 'Carvel (restaurant)', 'Clinton St. Baking Company & Restaurant', 'Dean & DeLuca', "Dr. Brown's – sodas", "Drake's Cakes – cakes, pies, pastries", 'Domino Foods', "Entenmann's – cakes, pies, pastries", 'Fairway Market', 'Ferrara Bakery and Cafe - first Italian caffe to open up in America', 'Food Network – cable TV channel', 'Fraunces Tavern – George Washington said goodbye to his troops here. Some departments of his new federal government were originally located here.', 'Golden Krust Caribbean Bakery & Grill', 'Gray\'s Papaya – hot dog institution where there is always a "recession special"', 'Grotta Azzurra', "Grimaldi's Pizzeria", 'Häagen-Dazs', 'Hebrew National', "Junior's – The World's Most Fabulous Cheesecake", "Katz's Deli", 'Kesté', 'Key Food supermarket', 'L&B Spumoni Gardens', "Lindy's", "Lombardi's – first pizzeria in America", "Nathan's", 'Now and Later candy', 'Papaya King', 'PepsiCo, Inc.', 'Peter Luger Steak House', "Ray's Pizza – a fierce debate over which was the original", 'Russian Tea Room', 'Second Avenue Deli', 'Serendipity 3', 'Sbarro', 'Shake Shack', 'Snapple', "Stella D'oro – biscuits, cookies", "T.G.I. Friday's – originally a NYC bar", "Totonno's - first pizzeria to open up in Brooklyn", 'The Halal Guys', 'Vitamin Water', 'Yoo-hoo – chocolate drink', "Zabar's"]
['New York Food Anywhere', 'Who Cooked That Up?', 'New York Gastronomic & Cultural Food Tours', "Explore Manhattan's Unique Neighborhoods and Foods", 'The Best Of Brooklyn Multicultural Ethnic Neighborhood Food Tasting and Culture Tour', 'Find NYC street food vendors', 'Great Eating In Flushing']

EDIT:

Вот фрагмент кода для размещения place1 и хранения данных внутри dict:

from bs4 import BeautifulSoup
import requests

page = requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City')

soup = BeautifulSoup(page.text, 'html.parser')


results = {}
articles = soup.find_all('div', class_ = "div-col columns column-width")
for article in articles:
    # Check if its the right element
    if article.find_previous_sibling('h2').find('span').get('id') == 'Enclaves_reflecting_national_cuisines':
        category = article.find_previous_sibling('h3')
        title_key = category.find('span',{'class':'mw-headline'}).get_text()
        if not title_key in results.keys():
            results[title_key] = []
        results[title_key] = article.text.split('\n')[1:len(article.text.split('\n'))-1]

print(results)

ВЫВОД:

{'Brooklyn': ['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, '
              'Palestinian, Yemeni and other Arabic',
              'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, '
              'Puerto Rican and West Indian',
              'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, '
              'Uzbek',
              'Borough Park – Jewish, Italian, Mexican, Chinese',
              'Brighton Beach – Russian, Georgian, Turkish, Pakistani and '
              'Ukrainian',
              'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian',
              'Canarsie – Jamaican, West Indian, African-American',
              'Carroll Gardens – Italian',
              'Crown Heights – Jamaican, West Indian, and Jewish',
              'East New York – African-American, Dominican, and Puerto Rican',
              'Flatbush – Jamaican, Haitian, and Creole',
              'Greenpoint – Polish and Ukrainian',
              'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish',
              'Midwood – Jewish, Italian, Russian, and Pakistani',
              'Park Slope – Italian, Irish, French, and Puerto Rican '
              '(formerly)',
              'Red Hook – Puerto Rican, African-American, and Italian',
              'Sheepshead Bay – Seafood, Russian, and Italian',
              'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian',
              'Williamsburg – Italian, Jewish, Dominican and Puerto Rican'],
 'Manhattan': ['Chinatown – Chinese and Vietnamese',
               'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban '
               'and Italian',
               'East Village – Japanese, Korean, Indian and Ukrainian',
               'Greenwich Village –  Italian',
               'Harlem – Italian, African-American, Latin American, West '
               'Indian, and West African',
               'Koreatown – Korean',
               'Little Italy – Italian',
               'Lower East Side – Puerto Rican, Jewish, Italian, and Latin '
               'American',
               'Murray Hill – Indian, Pakistani and Bangladeshi',
               'Washington Heights – Dominican, Puerto Rican, Italian and '
               'Jewish',
               'Upper East Side – German, Czech, Hungarian'],
 'Queens': ['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian '
            'and other Arabic',
            'Bellerose – Indian and Pakistani',
            'Flushing – Chinese and Korean',
            'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and '
            'Uzbek',
            'Howard Beach; Ozone Park – Italian',
            'Glendale – German and Polish',
            'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, '
            'Ecuadorian, Peruvian, Korean, Filipino and Mexican',
            'Jamaica – Bangladeshi, Caribbean; African-American; African; '
            'Creole',
            'Little Neck – Arab, Chinese, and Italian',
            'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, '
            'Bangladeshi',
            'The Rockaways - Irish, Jewish',
            'Woodhaven – Irish, Dominican, Mexican, Guyanese',
            'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian'],
 'The Bronx': ['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on '
               '204th St.)',
               'Belmont – Italian, Albanian (also known as "Arthur Avenue," '
               '"Little Italy")',
               'City Island – Italian, Seafood',
               'Morris Park – Italian, Albanian',
               'Norwood – Filipino (formerly Irish, less so today)',
               'Riverdale – Jewish',
               'South Bronx – Puerto Rican, Dominican',
               'Wakefield – Jamaican, West Indian',
               'Woodlawn – Irish']}
0 голосов
/ 02 мая 2019

Вы можете найти нужные заголовки, а затем соответствующие местоположения и типы продуктов:

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/Cuisine_of_New_York_City').text, 'html.parser')
headers = [i.span.text for i in d.find_all('h3') if i.find('span', {'class':'mw-headline'})]
final_result = {a:[i.text for i in b.find_all('li')] for a, b in zip(headers, d.find_all('div', {'class':'div-col columns column-width'}))}

Вывод:

{'The Bronx': ['Bedford Park – Mexican, Puerto Rican, Dominican, Korean (on 204th St.)', 'Belmont – Italian, Albanian (also known as "Arthur Avenue," "Little Italy")', 'City Island – Italian, Seafood', 'Morris Park – Italian, Albanian', 'Norwood – Filipino (formerly Irish, less so today)', 'Riverdale – Jewish', 'South Bronx – Puerto Rican, Dominican', 'Wakefield – Jamaican, West Indian', 'Woodlawn – Irish'], 'Queens': ['Astoria – Greek, Italian, Eastern European, Brazilian, Egyptian and other Arabic', 'Bellerose – Indian and Pakistani', 'Flushing – Chinese and Korean', 'Forest Hills; Kew Gardens Hills; Rego Park – Jewish, Russian and Uzbek', 'Howard Beach; Ozone Park – Italian', 'Glendale – German and Polish', 'Jackson Heights – Indian, Pakistani, Bangladeshi, Colombian, Ecuadorian, Peruvian, Korean, Filipino and Mexican', 'Jamaica – Bangladeshi, Caribbean; African-American; African; Creole', 'Little Neck – Arab, Chinese, and Italian', 'Richmond Hill – Indian, Guyanese, West Indian, Pakistani, Bangladeshi', 'The Rockaways - Irish, Jewish', 'Woodhaven – Irish, Dominican, Mexican, Guyanese', 'Woodside; Sunnyside – Filipino, Irish, Mexican, and Romanian'], 'Brooklyn': ['Bay Ridge – Irish, Italian, Greek, Turkish, Lebanese, Palestinian, Yemeni and other Arabic', 'Bedford-Stuyvesant – African-American, Jamaican, Trinidadian, Puerto Rican and West Indian', 'Bensonhurst; – Italian, Chinese, Turkish, Russian, Mexican, Uzbek', 'Borough Park – Jewish, Italian, Mexican, Chinese', 'Brighton Beach – Russian, Georgian, Turkish, Pakistani and Ukrainian', 'Bushwick – Puerto Rican, Mexican, Dominican, and Ecuadorian', 'Canarsie – Jamaican, West Indian, African-American', 'Carroll Gardens – Italian', 'Crown Heights – Jamaican, West Indian, and Jewish', 'East New York – African-American, Dominican, and Puerto Rican', 'Flatbush – Jamaican, Haitian, and Creole', 'Greenpoint – Polish and Ukrainian', 'Kensington – Bengali, Pakistani, Mexican, Uzbek, and Polish', 'Midwood – Jewish, Italian, Russian, and Pakistani', 'Park Slope – Italian, Irish, French, and Puerto Rican (formerly)', 'Red Hook – Puerto Rican, African-American, and Italian', 'Sheepshead Bay – Seafood, Russian, and Italian', 'Sunset Park – Puerto Rican, Chinese, Arab, Mexican and Italian', 'Williamsburg – Italian, Jewish, Dominican and Puerto Rican'], 'Staten Island': ['Chinatown – Chinese and Vietnamese', 'East Harlem – Puerto Rican, Mexican, Dominican, Chinese-Cuban and Italian', 'East Village – Japanese, Korean, Indian and Ukrainian', 'Greenwich Village –  Italian', 'Harlem – Italian, African-American, Latin American, West Indian, and West African', 'Koreatown – Korean', 'Little Italy – Italian', 'Lower East Side – Puerto Rican, Jewish, Italian, and Latin American', 'Murray Hill – Indian, Pakistani and Bangladeshi', 'Washington Heights – Dominican, Puerto Rican, Italian and Jewish', 'Upper East Side – German, Czech, Hungarian'], 'Manhattan': ['Manhattan clam chowder', 'New York-style cheesecake', 'New York-style pizza', 'New York-style bagel', 'New York-style pastrami', 'Corned beef[4]', 'Baked pretzels', 'New York-style Italian ice', 'Knish', 'Eggs Benedict', 'Chopped Cheese', 'Lobster Newberg', 'Waldorf Salad', 'Doughnut', 'Delmonico steak', 'Black and white cookie', 'Bacon, egg and cheese sandwich on a roll'], 'Food associated with or popularized in New York City': ['celery soda', 'New York-style pastrami, pastrami on rye', 'brisket[4]', 'corned beef[4]', 'tongue', 'knish[4]', 'New York-style bagels and lox (see also: appetizing)[4]', 'Bagel and cream cheese', 'cream cheese', 'whitefish with and without pike', 'Gefilte fish', 'blintzes[4]', 'potato pancake', 'bialy[4]', 'challah bread', 'matzo', 'egg cream', 'pickled cucumbers (especially dill pickles)', 'kishka', 'potato kugel', 'chopped chicken liver', 'matzo ball soup', 'lokshen soup'], 'Dishes invented or claimed in New York City': ['Bloody Mary', 'Chef salad', 'Chicken à la King[13]', 'Chicken and waffles', 'Chicken Divan', 'Cronut', 'Delmonico steak', 'Egg cream', 'Eggs Benedict', "General Tso's chicken", 'Ice cream cone', 'Lobster Newburg', 'Mallomars[14]', 'Manhattan', 'Manhattan Special – A type of carbonated espresso drink.', 'Pasta primavera', 'Penne alla Vodka', 'Reuben sandwich', 'Steak Diane', 'Spaghetti and meatballs', 'Vichyssoise', 'Waldorf salad']}
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...