Веб-браузер Python Beautiful Soup - Clinicaltrials.gov - подробное описание (вопрос новичка) - PullRequest
0 голосов
/ 21 сентября 2019

Я пытаюсь получить как краткое, так и подробное описание проектов на Clinicaltrials.gov.Я получаю краткое резюме довольно легко, и я мог бы сделать кучу кода для пэчворка / расщепления, чтобы получить подробное резюме, однако я ищу что-то более чистое.Кроме того, в одном из URL (https://clinicaltrials.gov/ct2/show/study/NCT03089801), Подробная сводка скрыта, и я не могу извлечь ее вместе с моим кодом. Я хочу пересмотреть свой код, чтобы получить Подробную сводку более чистым способом, даже если он«скрыт». Я застрял и ценю любую помощь.

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

out = []

allncturls = ['https://clinicaltrials.gov/ct2/show/study/NCT03089801', 
'https://clinicaltrials.gov/ct2/show/NCT02655991']


for url in allncturls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    briefdescription = soup.find(class_='ct-body3 tr-indent2').get_text()
    m = soup.find_all(headers='studyInfoColData')
    detaileddescription = soup.find_all(class_='ct-body3')
    detaileddescription = str(detaileddescription)
    detaileddescription = detaileddescription.split('Detailed Description:')[1] if 'Detailed 
    Description:' in detaileddescription else detaileddescription
    detaileddescription = detaileddescription.split('</div>, <td class="ct-body3">')[0]
    detaileddescription = detaileddescription.split('</div>, <div class="ct-body3 tr-indent2">')[1]

    data = {'project_name': project_name, 'pi': pi, 'briefdescription': briefdescription, 
    'detaileddescription': detaileddescription}
    out.append(data)
    df = pd.DataFrame(out)

    df.to_excel('clinicaltrialstresults.xlsx')

Ответы [ 3 ]

2 голосов
/ 21 сентября 2019

Вот пример того, как извлечь краткое и длинное описание исследования, используя модуль requests и lxml.html

import requests
import lxml.html


def scraper(url: str, timeout: int = 5) -> tuple:
    """
    Scrap short and detailed study descriptions.

    :param url: The url of the study.
    :type url: str

    :param timeout: How long to wait for a response.
    :type timeout: int

    :return: A tuple consisting of the short and long study description.
    """
    # Add long description toggler to url
    url += "?show_desc=Y#desc"
    # Make the request and parse as tree
    response = requests.get(url=url, timeout=timeout)
    tree = lxml.html.fromstring(response.text)
    short, long = tree.find_class("ct-body3 tr-indent2")
    short, long = short.text_content(), long.text_content()
    return short, long

Ключ расширял URL-адрес параметром show_desc=Y#desc:переключает длинное описание и добавляет его в html.

Вот тестовый прогон с первым предоставленным URL

short, long = scraper("https://clinicaltrials.gov/ct2/show/study/NCT03089801")
print('Short description:\n\n%s\n%s\n\nLong description:\n\n%s' % (short, '-' * 25, long))
# Short description:
# 
# In order to enhance access to clinical and mental health services for Veterans who have geographic, clinical, or social barriers to in-person care, VA Offices of Connected Care and Rural Health began distributing 5,000 tablets to Veterans with access barriers in 2016. The objective of this Quality Improvement evaluation is to:
# 
#     Understand characteristics of Veterans who received tablets, the frequency and ways in which they used the tablets, and the effects of tablet use on access to VA services.
#     Through a survey of Veterans, evaluate patient experiences using the tablets, and determine how tablets influenced patients' experiences with VA care, including their satisfaction, communication with providers, and access to needed services.
#     Identify implementation barriers and facilitators to tablet distribution and use through interviews with clinicians and staff in a purposive sample of VA facilities
#     Evaluate the effects of tablet use on chronic medical condition outcomes (e.g., hypertension, diabetes) and mental health treatment initiation and engagement (e.g., for depression, PTSD, and substance use).
# 
# -------------------------
# 
# Long description:
# 
# Background:
#   Telehealth is a cornerstone of enhanced access for Veterans and across a range of conditions is associated with improved disease control, quality of life, and patient satisfaction. Increasingly Veterans are able to monitor their chronic conditions and communicate with clinicians and care teams via tablets and other devices. However, this service is currently only available to Veterans with in-home Internet and video capability, or Veterans who are able to travel to a VA community based outpatient clinics to connect with providers at other facilities. In 2016, in order to address this access gap and disparity, VA launched an initiative to distribute tablets to Veterans who have clinical needs for remote care, and barriers to traditional in-person access.
#   Veterans who meet specific need-based (access, technology, and clinical) criteria may be issued one of two devices: Commercially available Off the Shelf (COTS) for basic connectivity or Healthcare Access Tablet (HAT) with a general exam camera and optional peripheral devices (i.e., stethoscope, BP monitor, pulse oximeter, thermometer, or weight scale). VA providers refer eligible patients for the devices using a consult template in VA's electronic health record. Care delivered via the tablet is indicated in the referral and may include one or more of the following: Home Based Primary Care, Palliative Care, Mental Health Intensive Case Management, Spinal Cord Injury, Mental Health Care, care for patients with marked mobility problems, care for patients with cognitive problems (these patients must have a caregiver who can assist with technology), home evaluations, and rehabilitation/prosthetics. Once the patient is issued the device, he or she will receive tablet services from trained teleproviders.
#   The VA began distributing tablets in the spring of 2016, with the plan of distributing 5,000 tablets over the following 1-2 years. Veteran eligibility criteria for tablets include the following: 1) Enrolled in VA Healthcare, 2) Does not own a device or does not have working broadband or cellular internet connection, 3) Physically and cognitively able to operate the technology (or has caregiver who can assist), 4) Barriers to access, such as a) distance or geography, b) transportation issues, c) homebound or difficulty leaving home, d) other (described by provider), and 5) Provider and patient give informed consent agreeing to utilize telehealth for care.
#   The tablet initiative and evaluation have been designated as Quality Improvement by VA's Office of Rural Health. The evaluation will include the following:
# 
#     Tablet Recipient Characteristics, Use of Tablets, and Effects on Access. The investigators will first characterize Veterans who are issued and use the devices (e.g., age, sex, medical and mental health conditions, rural location/distance from VA). Investigators will describe the frequency of tablet use and the types of services that the Veteran receives (e.g., chronic disease management, mental health therapy, palliative care, home-based primary care). Investigators will analyze rates of in-person (outpatient, emergency care), telephone, and telehealth-based care before and after tablet distribution, and compare patterns to those observed in a cohort of comparable patients to assess whether tablets influence access and patterns of use.
#     Effects on Patient Experience. For patients receiving tablets beginning in March, 2017, the investigators will administer a survey at time of tablet receipt, and 3-6 months after that time, to examine changes in patients' satisfaction with VA care and their perceived access and communication, and to evaluate their experiences using the tablets. The survey will also assess patients' needs and risk factors (e.g., social support, health literacy), and how these factors impact patients' experiences with the tablets and VA care. If resources permit, the survey may be administered to a cohort of comparable patients who have not received tablets (to be determined as of March, 2017).
#     Implementation Evaluation. The implementation evaluation will be guided by the Consolidated Framework for Implementation Research (CFIR). The investigators will first administer an online survey to Facility Telehealth Coordinators (FTCs) at facilities that are distributing tablets. The survey will query FTCs about the tablet initiative, resources that facilitated implementation, and barriers that impeded implementation. The investigators will use survey responses to identify FTCs who represent a range of VA facilities (in terms of high vs. low tablet distribution rates). Follow-up interviews will be conducted by telephone. The investigators will transcribe and code the interviews using standard content analysis methods with the goal of understanding barriers and facilitators to tablet distribution within each of the CFIR domains.
#     Effects on Chronic Disease and Mental Health Outcomes. If resources are available in FY18, the investigators will evaluate how device distribution influences clinical outcomes for Veterans with common and high-risk conditions, such as hypertension, diabetes, and PTSD (conditions to be determined based on prevalence rates in the tablet recipient population). The investigators will compare measures of disease control (e.g., blood pressure readings, hemoglobin A1C levels) at 3 and 6 months after device shipment, and compare these levels to comparable patients from other facilities, using propensity scores to match patients on the basis of sociodemographic and clinical characteristics. The investigators will use similar methods to examine treatment initiation and engagement rates among patients with common mental health conditions, such as depression, PTSD, and substance use disorder.
# 
#   The proposed project will be conducted with support from the eHealth Partnered Evaluation Initiative, a partnership between QUERI and Office of Connected Health that aims to evaluate the implementation of patient-provider technologies across VA, and understand their impacts on Veteran experience, perceived burdens and benefits to clinical teams, access to care, other care processes, and Veteran health outcomes.
# 
1 голос
/ 21 сентября 2019

Измените ваш URL, как показано ниже, чтобы в ответ автоматически отображалась подробная информация.Вы также можете с помощью bs4 4.7.1+ использовать: has и: contains для написания более целевых селекторов css.Селекторы Css должны быть быстрее, и вы сможете повторно использовать соединение через Session.

import requests, re
from bs4 import BeautifulSoup as bs

codes = ['NCT03089801', 'NCT02655991']
out = []

with requests.Session() as s:
    for code in codes:
        r = s.get(f'https://clinicaltrials.gov/ct2/show/study/{code}?show_desc=Y')
        soup = bs(r.content,'lxml')
        data = {'project_name': soup.select_one('.tr-h1').text.strip(), 
                'pi': soup.select_one('[headers="name"]').text,
                'briefdescription': re.sub('\n+|\xa0','',soup.select_one('.ct-body3:contains("Brief Summary:") + div').text.strip()), 
                'detaileddescription': ' '.join([i.text for i in soup.select('div:has(#detaileddesc) + div p')])
               }
        out.append(data)
0 голосов
/ 21 сентября 2019

Подробная сводка не загружается, пока вы не нажмете, чтобы развернуть ее, поэтому вы не можете ее получить.BeautifulSoup может только анализировать HTML и не нажимать ни на какие кнопки.

Чтобы щелкнуть по нему, вам нужна библиотека, взаимодействующая со страницей.Используйте библиотеку селена с драйвером Firefox или Chrome .Вам нужно установить эти браузеры, прежде чем вы сможете их использовать.

Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...