Данные вашей страницы заполнены динамически, используя Javascript
из JSON
. Но JSON
не загружается через XHR
. Вы можете найти JSON
в HTML
и вы можете извлечь JSON
с помощью Regex
и преобразовать JSON
в Dictionary
.
import re
import requests
url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
response = requests.get(url,headers=headers)
# Extract JSON from HTML.
json_data = re.findall(r"_ports = {\n\s\s(.+?)\n\s\s};", response.text)
# Convert String to Dictionary.
json_data = eval('{' + json_data[0] + '}')
print(json_data.values())
Вывод:
dict_values(['Aqaba, Jordan', 'Valencia, Spain', 'Cairs, Australia', 'Venice, Italy', 'Shekou, China', 'Shanghai, China', 'Goteborg, Sweden', 'Darwin, Australia', 'George Town, Cayman Islands', 'Siracusa, Italy', 'Genoa, Italy', 'Reykjavik, Iceland', 'Havana, Cuba', 'Singapore, Republic of Singapore', 'Arica, Chile', 'Hamburg,Germany', 'Kusadasi, Turkey', 'Yokohama, Japan', 'Valparaiso,Chile', 'Copenhagen, Denmark', 'Civitavecchia, Italy', 'Barcelona, Spain', 'Auckland, New Zealand', 'Livorno, Italy', 'Montevideo, Uruguay', 'Brindisi, Italy', 'Kiel,Germany', 'San Juan, Puerto Rico', 'Callao, Peru', 'Funchal, Portugal', 'Haifa, Israel', 'Lisbon, Portugal', 'Papeete, Tahiti', 'Trieste, Italy', 'Piraeus, Greece', 'Rio de Janeiro, Brazil', 'Keelung, Taiwan', 'Buenos Aires, Argentina', 'New York, United States', 'Salvador, Brazil', 'Tianjin, China', 'Valletta, Malta', 'Santos, Brazil', 'Cannes, France', 'Naples, Italy', 'Fukuoka, Japan', 'Ushuaia,Argentina', 'Philipsburg, St. Maarten', 'Zeebrugge, Belgium', 'Durban, South Africa', 'Istanbul, Turkey', 'Cagliari, Italy', 'Vigo, Spain', 'Dubai,U.Arab Emirates', 'Amsterdam, Netherlands', 'Tampa, United States', 'Doha, Qatar', 'Abu Dhabi,U.Arab Emirates', 'Itajai, Brazil', 'Port Kembla, Australia', 'Tokyo, Japan', 'Cartagena, Spain', 'Nassau, Bahamas', 'Messina, Italy', 'Benoa/Bali, Indonesia', 'Nansha,China', 'Heraklion, Greece', 'Mumbai/Bombay, India', 'Muscat, Oman', 'Wellington, New Zealand', 'Warnemunde,Germany', 'Fort de France, Martinique', 'Isafjordur, Iceland', 'Bridgetown, Barbados', 'Marseille, France', 'Sydney, Australia', 'Miami, Florida', 'Cozumel, Mexico', 'Rotterdam, Netherlands', 'Izmir, Turkey', 'Cape Town, South Africa', 'Qingdao, China', 'Palma de Mallorca, Spain', 'San Francisco, United states', 'Hobart, Australia', 'Malaga, Spain', 'Palermo, Italy', 'St Nazaire, France', 'Mindelo, Cape Verde', 'Pointe-a-Pitre, Guadeloupe', 'Hong Kong,Hong Kong', 'Le Havre, France', 'Ocean Cay MSC Marine Reserve', 'St Petersburg, Russian Fed.', 'Ilhabela, Brazil', 'Ancona, Italy', ......., 'Corner Brook, Canada', 'Brunsbuttel,Germany', 'Newcastle, Australia', 'Busan, Korea, Republic of', 'Maputo, Mozambique'])
Или вы можете использовать Selenium
ChromeDriver
, который загружает Javascript
в HTML
. Таким образом, вы можете извлечь эти данные, используя lxml
.
from selenium import webdriver
from lxml import etree
driver = webdriver.Chrome(executable_path=r"***YOUR_CHROME-DRIVER_PATH***")
driver.get('https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false')
html = etree.HTML(driver.page_source)
driver.close()
portList = html.xpath('//*[@class="cr-city-name"]')
for port in portList:
print(port.xpath('string()'), end=' | ')
Вывод:
Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | $PORT |
Вы можете скачать ChromeDriver с здесь .