Question

У меня есть следующее, что возвращается из python запросов:

{"error":{"ErrorMessage":"
<div>
<p>To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
    <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/p><\\/div>","CodeName":"Success","ErrorStatus":0},"calendar":{"calendar":"
        <div class=\\"wsResponse\\">To protect your privacy, this form will not display details such as a clinical or assisted collection. If you believe that the information detailed above is incomplete or incorrect, please tell us here 
            <a href=\\"http:\\/\\/www.southhams.gov.uk\\/wastequestion\\">www.southhams.gov.uk\\/wastequestion<\\/a><\\/div>"},"binCollections":{"tile":[["
                <div class=\'collectionDiv\'>
                    <div class=\'fullwidth\'>
                        <h3>Organic Collection Service (Brown Organic Bin)<\\/h3><\\/div>
                            <div class=\\"collectionImg\\">
                                <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/brown bin.png\\" \\/><\\/div>\\n                    
                                <div class=\'wdshDetWrap\'>Your brown organic bin collection is 
                                    <b>Fortnightly<\\/b> on a 
                                        <b>Thursday<\\/b>.
                                            <br\\/> \\n                    Your next scheduled collection is 
                                            <b>Friday, 29 May 2020<\\/b>. 
                                                <br\\/>
                                                <br\\/>
                                                <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3427\\">Read more about the Organic Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                    <div class=\'collectionDiv\'>
                                                        <div class=\'fullwidth\'>
                                                            <h3>Recycling Collection Service (Recycling Sacks)<\\/h3><\\/div>
                                                                <div class=\\"collectionImg\\">
                                                                    <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/SH_two_rec_sacks.png\\" \\/><\\/div>\\n                    
                                                                    <div class=\'wdshDetWrap\'>Your recycling sacks collection is 
                                                                        <b>Fortnightly<\\/b> on a 
                                                                            <b>Thursday<\\/b>.
                                                                                <br\\/> \\n                    Your next scheduled collection is 
                                                                                <b>Friday, 29 May 2020<\\/b>. 
                                                                                    <br\\/>
                                                                                    <br\\/>
                                                                                    <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3383\\">Read more about the Recycling Collection Service &gt;<\\/a><\\/div><\\/div>"],["
                                                                                        <div class=\'collectionDiv\'>
                                                                                            <div class=\'fullwidth\'>
                                                                                                <h3>Refuse Collection Service (Grey Refuse Bin)<\\/h3><\\/div>
                                                                                                    <div class=\\"collectionImg\\">
                                                                                                        <img src=\\"https:\\/\\/southhams.fccenvironment.co.uk\\/library\\/images\\/grey bin.png\\" \\/><\\/div>\\n                    
                                                                                                        <div class=\'wdshDetWrap\'>Your grey refuse bin collection is 
                                                                                                            <b>Fortnightly<\\/b> on a 
                                                                                                                <b>Thursday<\\/b>.
                                                                                                                    <br\\/> \\n                    Your next scheduled collection is 
                                                                                                                    <b>Thursday, 04 June 2020<\\/b>. 
                                                                                                                        <br\\/>
                                                                                                                        <br\\/>
                                                                                                                        <a href=\\"https:\\/\\/www.southhams.gov.uk\\/article\\/3384\\">Read more about the Refuse Collection Service &gt;<\\/a><\\/div><\\/div>"]]}}

Я хотел бы извлечь следующее для каждого collectiondiv (3)

Organi c Collection Service (Brown Organi c Bin) Пятница, 29 мая 2020 г.

Служба сбора вторичного сырья (мешки для вторичной переработки) Пятница, 29 мая 2020 г.

в настоящее время я пытался загрузить response.content в обработчик python json, но все еще застрял при извлечении данных, поэтому я попробовал BeautifulSoup с soup.find_all ("div", class _ = "wdshDetWrap "), но все еще не могу получить точные данные, может ли l xml или что-то подобное быть более простым способом?"

Спасибо за внимание

код запроса:

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.request("GET", url)

cookiejar = response.cookies
for cookie in cookiejar:
print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

response = requests.request("POST", url, headers=headers, data = payload)

print(response.status_code)

chitown88 · Answer 1 · 25 мая 2020

Вы получаете json напрямую, а затем можете вызвать это html значение. Как только вы это сделаете, используйте beautifulsoup для анализа html и распечатайте контекст / текст в тегах, где он находится:

import requests
from bs4 import BeautifulSoup

url = "https://southhams.fccenvironment.co.uk/mycollections"

response = requests.get(url)

cookiejar = response.cookies
for cookie in cookiejar:
    print(cookie.name,cookie.value)

url = "https://southhams.fccenvironment.co.uk/ajaxprocessor/getcollectiondetails"

payload = 'fcc_session_token={}&uprn=100040282539'.format(cookie.value)
headers = {
  'X-Requested-With': 'XMLHttpRequest',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Cookie': 'fcc_session_cookie={}'.format(cookie.value)
}

jsonData = requests.post(url, headers=headers, data = payload).json()


data = jsonData['binCollections']['tile']
for each in data:
    soup = BeautifulSoup(each[0], 'html.parser')
    collection = soup.find('div', {'class':'collectionDiv'}).find('h3').text.strip()
    date = soup.find_all('b')[-1].text.strip()

    print (collection, date)

Вывод:

Organic Collection Service (Brown Organic Bin) Friday, 29 May 2020
Recycling Collection Service (Recycling Sacks) Friday, 29 May 2020
Refuse Collection Service (Grey Refuse Bin) Thursday, 04 June 2020

Abhishek Jebaraj · Answer 2 · 25 мая 2020

Документ HTML с определенного сайта отформатирован неправильно. Мне все же удалось обойтись (будет неэффективно в масштабе около 1000 тегов).

Так что это можно улучшить.

headers = soup.find_all('h3')
names = [tag.text[:tag.text.find('<')] for tag in headers]
dates = [tag.find_all('b')[2].text[:tag.find_all('b')[2].text.find('<')] for tag in headers]

print(names)
print(dates)

#Output
['Organic Collection Service (Brown Organic Bin)', 'Recycling Collection Service (Recycling Sacks)', 'Refuse Collection Service (Grey Refuse Bin)']
['Friday, 29 May 2020', 'Friday, 29 May 2020', 'Thursday, 04 June 2020']

Лучший способ извлечь определенные c части со страницы html / json?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Лучший способ извлечь определенные c части со страницы html / json?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы