Как я могу получить все значения одного класса через Xpath? - PullRequest
0 голосов
/ 15 января 2020

Попытайтесь узнать о очистке Xpath, но не можете сделать это.

Когда я использую вспомогательный плагин Xpath в Chrome, я могу получить такие данные. около 99 портов, последний из которых "$ PORT"

снимок экрана Xpath

import requests
import csv
from lxml import etree

url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
source = requests.get(url,headers=headers).content.decode('UTF-8')

html = etree.HTML(source)

portList = html.xpath('//*[@class="cr-city-name"]')

for port in portList:
    print(port.xpath('string()'))

С этим КОДОМ только верните мне "$ PORT", и я хочу знать, почему я не могу получить данные других 98 портов из этого Xpath?

1 Ответ

0 голосов
/ 15 января 2020

Данные вашей страницы заполнены динамически, используя Javascript из JSON. Но JSON не загружается через XHR. Вы можете найти JSON в HTML и вы можете извлечь JSON с помощью Regex и преобразовать JSON в Dictionary.

import re
import requests

url = 'https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
response = requests.get(url,headers=headers)

# Extract JSON from HTML.
json_data = re.findall(r"_ports = {\n\s\s(.+?)\n\s\s};", response.text)

# Convert String to Dictionary.
json_data = eval('{' + json_data[0] + '}')

print(json_data.values())

Вывод:

dict_values(['Aqaba, Jordan', 'Valencia, Spain', 'Cairs, Australia', 'Venice, Italy', 'Shekou, China', 'Shanghai, China', 'Goteborg, Sweden', 'Darwin, Australia', 'George Town, Cayman Islands', 'Siracusa, Italy', 'Genoa, Italy', 'Reykjavik, Iceland', 'Havana, Cuba', 'Singapore, Republic of Singapore', 'Arica, Chile', 'Hamburg,Germany', 'Kusadasi, Turkey', 'Yokohama, Japan', 'Valparaiso,Chile', 'Copenhagen, Denmark', 'Civitavecchia, Italy', 'Barcelona, Spain', 'Auckland, New Zealand', 'Livorno, Italy', 'Montevideo, Uruguay', 'Brindisi, Italy', 'Kiel,Germany', 'San Juan, Puerto Rico', 'Callao, Peru', 'Funchal, Portugal', 'Haifa, Israel', 'Lisbon, Portugal', 'Papeete, Tahiti', 'Trieste, Italy', 'Piraeus, Greece', 'Rio de Janeiro, Brazil', 'Keelung, Taiwan', 'Buenos Aires, Argentina', 'New York, United States', 'Salvador, Brazil', 'Tianjin, China', 'Valletta, Malta', 'Santos, Brazil', 'Cannes, France', 'Naples, Italy', 'Fukuoka, Japan', 'Ushuaia,Argentina', 'Philipsburg, St. Maarten', 'Zeebrugge, Belgium', 'Durban, South Africa', 'Istanbul, Turkey', 'Cagliari, Italy', 'Vigo, Spain', 'Dubai,U.Arab Emirates', 'Amsterdam, Netherlands', 'Tampa, United States', 'Doha, Qatar', 'Abu Dhabi,U.Arab Emirates', 'Itajai, Brazil', 'Port Kembla, Australia', 'Tokyo, Japan', 'Cartagena, Spain', 'Nassau, Bahamas', 'Messina, Italy', 'Benoa/Bali, Indonesia', 'Nansha,China', 'Heraklion, Greece', 'Mumbai/Bombay, India', 'Muscat, Oman', 'Wellington, New Zealand', 'Warnemunde,Germany', 'Fort de France, Martinique', 'Isafjordur, Iceland', 'Bridgetown, Barbados', 'Marseille, France', 'Sydney, Australia', 'Miami, Florida', 'Cozumel, Mexico', 'Rotterdam, Netherlands', 'Izmir, Turkey', 'Cape Town, South Africa', 'Qingdao, China', 'Palma de Mallorca, Spain', 'San Francisco, United states', 'Hobart, Australia', 'Malaga, Spain', 'Palermo, Italy', 'St Nazaire, France', 'Mindelo, Cape Verde', 'Pointe-a-Pitre, Guadeloupe', 'Hong Kong,Hong Kong', 'Le Havre, France', 'Ocean Cay MSC Marine Reserve', 'St Petersburg, Russian Fed.', 'Ilhabela, Brazil', 'Ancona, Italy', ......., 'Corner Brook, Canada', 'Brunsbuttel,Germany', 'Newcastle, Australia', 'Busan, Korea, Republic of', 'Maputo, Mozambique'])

Или вы можете использовать Selenium ChromeDriver, который загружает Javascript в HTML. Таким образом, вы можете извлечь эти данные, используя lxml.

from selenium import webdriver
from lxml import etree

driver = webdriver.Chrome(executable_path=r"***YOUR_CHROME-DRIVER_PATH***")
driver.get('https://www.msccruisesusa.com/webapp/wcs/stores/servlet/MSC_SearchCruiseManagerRedirectCmd?storeId=12264&langId=-1004&catalogId=10001&monthsResult=&areaFilter=MED%40NOR%40&embarkFilter=&lengthFilter=&departureFrom=01.11.2020&departureTo=04.11.2020&ships=&category=&onlyAvailableCruises=true&packageTrf=false&packageTpt=false&packageCrol=false&packageCrfl=false&noAdults=2&noChildren=0&noJChildren=0&noInfant=0&dealsInput=false&tripSpecificationPanel=true&shipPreferencesPanel=false&dealsPanel=false')

html = etree.HTML(driver.page_source)
driver.close()

portList = html.xpath('//*[@class="cr-city-name"]')

for port in portList:
    print(port.xpath('string()'), end=' | ')

Вывод:

Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Palermo, Italy | Valletta, Malta | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Barcelona, Spain | Marseille, France | Genoa, Italy | Civitavecchia, Italy | Palermo, Italy | Valletta, Malta | Barcelona, Spain | MSC Grandiosa | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | Civitavecchia, Italy | Genoa, Italy | Malaga, Spain | Funchal, Portugal | Santa Cruz de Tenerife, Spain | Tangier, Morocco | Cartagena, Spain | Civitavecchia, Italy | MSC Opera | $PORT | 

Вы можете скачать ChromeDriver с здесь .

...