HTML секция разбора файла в csv - PullRequest
0 голосов
/ 11 января 2020

Я новичок ie в Python. Я пытаюсь получить все ответы от руководителей (упомянутых в верхней части) веб-страницы (https://www.dropbox.com/s/uka24w7o5006ole/transcript-86-855.html?dl=0). Эта веб-страница находится на моем жестком диске (поэтому не URL).

Таким образом, мой конечный результат будет:

Column 1  
All executives

Column 2  
all the answers

И ответ должен быть получен только из "вопрос-ответ- раздел ".

То, что я пробовал, было следующим:

from bs4 import BeautifulSoup
import requests 

with open('transcript-86-855.html') as html_file:
    soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'

Может ли кто-нибудь помочь мне?

1 Ответ

0 голосов
/ 11 января 2020

Если я вас правильно понял, вы хотите напечатать два столбца, один столбец - Имя (в данном случае Dror Ben Asher), другой столбец - его ответ.

Например:

import textwrap
from bs4 import BeautifulSoup

with open('page.html', 'r') as f_in:
    soup = BeautifulSoup(f_in.read(), 'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
    txt = answer.get_text(strip=True)

    s = answer.find_next_sibling()
    while s:
        if s.name == 'strong' or s.find('strong'):
            break
        if s.name == 'p':
            txt += ' ' + s.get_text(strip=True)
        s = s.find_next_sibling()

    txt = ('\n' + ' '*31).join(textwrap.wrap(txt))

    print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
    print()

Отпечатки:

Name                           Answer                                                                
-----------------------------------------------------------------------------------------------------
Dror Ben Asher - CEO           Thank you, Scott. Its a very good question indeed in January we
                               announced a new amendment and that amendment includes anti-TNF
                               patients some of them not all of them, those who qualify. And we are
                               talking about anti-TNF failures to be clear and only Remicade and
                               Humira. The idea here was to increase very significantly the patients
                               pooled of those potentially eligible for the study thus expediting
                               recruitment. Did I answer your question?

Dror Ben Asher - CEO           Right, this is one of most important tasks; right now the most
                               important item here is the divestment of non-core assets. All other
                               non-core assets, the non-core assets are those that are not within our
                               therapeutic focus of GI and inflammation. And those are specifically
                               RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
                               RHB-101 is a legacy drug, we have recently announced last month, we
                               announced that we are in discussions for both of these product for
                               out-licensing, which we hope to complete in the first half of 2015. So
                               this is the highest priority, obviously discussion on other product,
                               but Redhill is in the fortunate position that we are able to complete
                               our Phase III studies with our existing results, resources and as time
                               goes by obviously the value of the assets keeps going up. So we are in
                               no rush to out-license everything else and so there is obviously in
                               track.

...and so on.
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...