Python BeautifulSoup - Как извлечь текст из тегов http / xml в «линейном» порядке - PullRequest
0 голосов
/ 07 октября 2019

У меня есть такой блок текста, из которого мне нужно извлечь текст (это макетированные данные):

<text>
            <table>
              <tbody>
<tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr>
<tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr>
<tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr>
<tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr>
<tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr>
</tbody>
            </table>
          </text>

Чтение этого (обратите внимание, что в моей фактической строке нет новых строк):

medsoup = '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'
medsoup  
Out[358]: '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'

Вопрос: Как мне извлечь текст из каждого тега в «линейном» порядке, как человек будет читать его слева направо, с пробелами между каждым экземпляромтекста?

Что я пробовалпробелы) между каждой отдельной текстовой записью. Обратите внимание, как General Adult ExamConstitutional:General Appearance: все работает вместе, когда мне нужно General Adult Exam Constitutional: General Appearance:

parsed_soup = BeautifulSoup(medsoup, 'lxml')
parsed_soup.get_text().strip()
Out[340]: 'General Adult ExamConstitutional:General Appearance: healthy-appearing, well-nourished, well-developedLungs:Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchiCardiovascular:Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruitsMusculoskeletal::Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation'

Если я попытаюсь перебрать отдельные элементы супа, в надежде добавить пробел после каждого фрагмента текста, яполучить странную вещь, когда кажется, что есть только один элемент для итерации.

for i, ele in enumerate(parsed_soup):
    print(i, ele, '\n')


0 <html><body><text> <table> <tbody><tr><td> </td><td><content stylecode="Bold">General Adult Exam</content></td></tr><tr><td><content stylecode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content stylecode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content stylecode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content stylecode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text></body></html> 

Я также пытался next_siblings и next_element, чтобы попытаться перебрать теги, я не могу получить либоработай.

...