У меня есть некоторый текст из документа XML, где я пытаюсь извлечь текст в тегах, содержащих определенные слова.
Например, ниже:
search('adverse')
должен возвращать текст всехтеги, содержащие слово «неблагоприятный»
Out:
[
"<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>"
]
и search('clinical')
, должны возвращать два результата, поскольку два тега содержат эти слова.
Out:
[
"<title>6.1 Clinical Trials Experience</title>",
"<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>"
]
Какие инструменты должныЯ использую для этого?RegEx?BS4?Любые предложения приветствуются.
Пример текста:
</highlight>
</excerpt>
<component>
<section id="ID40">
<id root="fbc21d1a-2fb2-47b1-ac53-f84ed1428bb4"></id>
<title>6.1 Clinical Trials Experience</title>
<text>
<paragraph id="ID41">The clinical efficacy and safety of coadministered dutasteride and tamsulosin, which are individual components of dutasteride and tamsulosin hydrochloride capsules, have been evaluated in a multicenter, randomized, double-blind, parallel group trial (the Combination with Alpha-Blocker Therapy, or CombAT, trial) </paragraph>
<list id="ID42" listtype="unordered" stylecode="Disc">
<item>The most common adverse reactions reported in subjects receiving coadministered dutasteride and tamsulosin were impotence, decreased libido, breast disorders (including breast enlargement and tenderness), ejaculation disorders, and dizziness.</item>