У меня есть такой блок текста, из которого мне нужно извлечь текст (это макетированные данные):
<text>
<table>
<tbody>
<tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr>
<tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr>
<tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr>
<tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr>
<tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr>
</tbody>
</table>
</text>
Чтение этого (обратите внимание, что в моей фактической строке нет новых строк):
medsoup = '<text> <table> <tbody><tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text>'
medsoup
Out[358]: '<text> <table> <tbody><tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text>'
Вопрос: Как мне извлечь текст из каждого тега в «линейном» порядке, как человек будет читать его слева направо, с пробелами между каждым экземпляромтекста?
Что я пробовалпробелы) между каждой отдельной текстовой записью. Обратите внимание, как General Adult ExamConstitutional:General Appearance:
все работает вместе, когда мне нужно General Adult Exam Constitutional: General Appearance:
parsed_soup = BeautifulSoup(medsoup, 'lxml')
parsed_soup.get_text().strip()
Out[340]: 'General Adult ExamConstitutional:General Appearance: healthy-appearing, well-nourished, well-developedLungs:Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchiCardiovascular:Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruitsMusculoskeletal::Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation'
Если я попытаюсь перебрать отдельные элементы супа, в надежде добавить пробел после каждого фрагмента текста, яполучить странную вещь, когда кажется, что есть только один элемент для итерации.
for i, ele in enumerate(parsed_soup):
print(i, ele, '\n')
0 <html><body><text> <table> <tbody><tr><td> </td><td><content stylecode="Bold">General Adult Exam</content></td></tr><tr><td><content stylecode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content stylecode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content stylecode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content stylecode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text></body></html>
Я также пытался next_siblings
и next_element
, чтобы попытаться перебрать теги, я не могу получить либоработай.