Выбрать элементы H3 или элементы UL после указанного элемента c h2? - PullRequest
1 голос
/ 11 июля 2020

Я пытаюсь очистить веб-сайт ready.gov/, чтобы узнать, что делать во время стихийного бедствия для школьного проекта. Я использую Python и Beautiful Soup для выполнения sh задачи.

У каждой катастрофы есть свой веб-сайт, например:

  1. https://www.ready.gov/active-shooter
  2. https://www.ready.gov/public-spaces

на каждом веб-сайте раздел, в котором описаны действия, которые необходимо предпринять во время мероприятия, можно выбрать, выбрав раздел с h2, который содержит слово «во время» или «во время».

И это проблема, с которой я столкнулся. Я могу перейти на веб-сайт и запросить данные, которые просто найду, но у меня проблемы с выбором шагов под тегом h2.

Я пробовал следующее.

# Open page
disaster_pg = 'https://www.ready.gov/public-spaces'
req = rq.get(disaster_pg).text
# parse html using beautifulsoup and store in soup
disaster_soup = bs(req,'html.parser')
disaster_soup

## doesn't recognize the h3 element
# h2s = disaster_soup.find_all('h2')
# for h2 in h2s:
#   if 'uring' in h2.text:
#     print (h2.text)
#     print (h2.h3)
#   elif 'URING' in h2.text:
#     print (h2.text)
#     print (h2.h3)
#   # print (h2)
#   # print(h2.h3.text)

## gets me to the correct element, but don't know how to navigate from here
# for elem in disaster_soup(text=re.compile(r'[dD]isaster')):
#     print (elem)

Любая помощь будет очень приветствоваться.

В идеале, запрос должен дать мне основной подсписок текста с тегом h2, содержащим слово «во время». Например, для страницы publi c -spaces я бы получил: ['оставаться начеку', 'бежать в безопасное место', 'укрываться и прятаться', 'защищаться, нарушать, сражаться', 'помогать раненым']

и для страницы активного стрелка: ['ЗАПУСТИТЬ и убежать, если возможно', 'СКРЫТЬ, если побег невозможен ». «СРАЖАЙТЕСЬ в крайнем случае.»]

заранее спасибо =]

Ответы [ 2 ]

1 голос
/ 11 июля 2020

Вы можете использовать .find_previous('h2'), чтобы проверить, содержит ли предыдущий <h2> during или DURING и .find_previous('h3'), чтобы получить фактический заголовок.

Например:

import requests
from pprint import pprint
from bs4 import BeautifulSoup


urls = ['https://www.ready.gov/active-shooter', 'https://www.ready.gov/public-spaces']

all_data = []
for url in urls:
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')    
    for ul in soup.select('.region-content ul'):
        if not 'during' in ul.find_previous('h2').text.lower():
            continue
        all_data.append({ul.find_previous('h3').get_text(strip=True): ul.get_text(strip=True, separator='\n')})
    
pprint(all_data)

Печать:

[{'RUN and escape if possible.': 'Getting away from the shooter or shooters is '
                                 'the top priority.\n'
                                 'Leave your belongings behind and get away.\n'
                                 'Help others escape, if possible, but '
                                 'evacuate regardless of whether others agree '
                                 'to follow.\n'
                                 'Warn and prevent individuals from entering '
                                 'an area where the active shooter may be.\n'
                                 'Call 9-1-1 when you are safe and describe '
                                 'the shooter, location and weapons.'},
 {'HIDE if escape is not possible.': 'Get out of the shooter’s view and stay '
                                     'very quiet.\n'
                                     'Silence all electronic devices and make '
                                     'sure they won’t vibrate.\n'
                                     'Lock and block doors, close blinds and '
                                     'turn off lights.\n'
                                     'Don’t hide in groups. Spread out along '
                                     'walls or hide separately to make it more '
                                     'difficult for the shooter.\n'
                                     'Try to communicate with police silently. '
                                     'Use text message or social media to tag '
                                     'your location or put a sign in a '
                                     'window.\n'
                                     'Stay in place until law enforcement '
                                     'gives you the all clear.\n'
                                     'Your hiding place should be out of the '
                                     "shooter's view and provide protection if "
                                     'shots are fired in your direction.'},
 {'FIGHT\xa0as an absolute last resort.': 'Commit to your actions and act as '
                                          'aggressively as possible against '
                                          'the shooter.\n'
                                          'Recruit others to ambush the '
                                          'shooter with makeshift weapons like '
                                          'chairs, fire extinguishers, '
                                          'scissors, books, etc.\n'
                                          'Be prepared to cause severe or\xa0'
                                          'lethal injury to the shooter.\n'
                                          'Throw items and improvise weapons '
                                          'to distract and disarm the '
                                          'shooter.'},
 {'Stay Alert': 'Pay attention to what is happening around you so that you can '
                'react quickly to attacks.'},
 {'Run to Safety': 'If there is an accessible escape path, attempt to evacuate '
                   'the building or area regardless of whether others agree to '
                   'follow.'},
 {'Cover and Hide': 'If evacuation is not possible find a place to hide out of '
                    'view of the attacker and if possible, put a solid barrier '
                    'between yourself and the threat.\n'
                    'Keep silent.'},
 {'Defend, Disrupt, Fight': 'As a last resort, when you can’t run or cover, '
                            'attempt to disrupt the attack or disable the '
                            'attacker.\n'
                            'Be aggressive and commit to your actions.'},
 {'Help the Wounded': 'Take care of yourself first and then, if you are able, '
                      'help the wounded get to safety and provide immediate '
                      'care.'}]
0 голосов
/ 11 июля 2020

Это хорошая проблема. Формат данных здесь немного отличается, и здесь необходимо провести оптимизацию.

Ссылка на веб-сайт - https://www.ready.gov/active-shooter

import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.ready.gov/active-shooter")
soup = BeautifulSoup(res.text, "lxml")
div = soup.find("div",class_="field-item even")
data = []


for ele in div.findChildren():
    if ele.name == "h2":
        if temp:
            data.append(temp)
        temp = {}
        temp["heading"] = ele.text.strip()
    else:
        if ele.name=="ul":
            description = [i.text.strip() for i in ele.find_all("li")]
            if "description" not in temp: temp["description"] = []
            temp["description"].append(description)
        elif ele.name == "h3":
            if "sub-heading" not in temp: temp["sub-heading"] = []
            temp["sub-heading"].append(ele.text.strip())

print(data)

Вывод:

[{'heading': 'Associated Content',
  'description': ['RUN. HIDE. FIGHT.® Surviving an Active Shooter Event - English (Video)',
   'Active Shooter Information Sheet (PDF)',
   'Department of Homeland Security (DHS) Active Shooter Preparedness Resources (Training, videos, brochures and more for individualized audiences link)',
   'Department of Homeland Security (DHS) Active Shooter Preparedness Resources Translated (Link)',
   'Conducting Security Assessments: A Guide for Schools and Houses of Worship Webinar (Link)']},
 {'heading': 'Be Informed',
  'description': ['Sign up for an active shooter training.',
   'If you see something, say something to the authorities right away.',
   'Sign up to receive local emergency alerts and register your contact information with any work-sponsored alert system.',
   'Be aware of your environment and any possible dangers.']},
 {'heading': 'Make a Plan',
  'description': ['Make a plan with your family and make sure everyone knows what to\xa0do if confronted with an active shooter.',
   'Wherever you go look for the two nearest exits, have an escape path in mind and identify places you could hide if necessary.',
   'Understand the plans for individuals with disabilities or other access and functional needs.']},
 {'heading': 'During',
  'sub-heading': 'FIGHT\xa0as an absolute last resort.',
  'description': ['Commit to your actions and act as aggressively as possible against the shooter.',
   'Recruit others to ambush the shooter with makeshift weapons like chairs, fire extinguishers, scissors, books, etc.',
   'Be prepared to cause severe or\xa0lethal injury to the shooter.',
   'Throw items and improvise weapons to distract and disarm the shooter.']},
 {'heading': 'After',
  'description': ['Keep hands visible and empty.',
   'Know that law enforcement’s first task is to end the incident and they may have to pass injured along the way.',
   'Officers may be armed with rifles, shotguns or handguns and may use pepper spray or tear gas to control the situation.',
   'Officers will shout commands and may push individuals to the ground for their safety.',
   'Follow law enforcement instructions and evacuate in the direction they come from unless otherwise instructed.',
   'Take care of yourself first, and then you may be able to help the wounded before first responders arrive.',
   'If the injured are in immediate danger, help get them to safety.',
   'While you wait for first responders to arrive, provide first aid. Apply direct pressure to wounded areas and use tourniquets if you have been trained to do so.',
   'Turn wounded people onto their sides if they are unconscious and keep them warm.',
   'Consider seeking professional help for you and your family to cope with the long-term effects of the trauma.']}]
...