Например, это моя строка (это текст из HTML)
html_text = """
TABLE OF CONTENTS
PART I
| ITEM 1. BUSINESS
| ITEM 1A. RISK FACTORS
| ITEM 1B. UNRESOLVED CONFLICTS
| ITEM 2. PROPERTIES
| ITEM 3. LEGAL PROCEEDINGS
We believe that relations with our employees are good; however, the competition
for such personnel is intense, and the loss of key personnel could have a
material adverse impact on our results of operations and financial condition.
ITEM 1A. | RISK FACTORS
Set forth below and elsewhere in this report and in other documents we file
with the SEC are descriptions of the risks and uncertainties that could cause
our actual results to differ materially from the results contemplated by the
forward-looking statements contained in this report.
ITEM 1B. UNRESOLVED CONFLICTS
Our future revenue, gross margins, operating results and net income are
difficult to predict and may materially"""
Я написал регулярное выражение для записи "ПУНКТ 1А. ФАКТОРЫ РИСКА" ( не из Оглавления )
re.search(r"(ITEM.*1A)*.+(RISK FACTORS).*\n+(?!\w)(?!.*ITEM.*1B)", html_text)
и еще одно регулярное выражение для ввода "ПУНКТ 1B. НЕРЕШЕННЫЕ КОНФЛИКТЫ" ( не из оглавления )
re.search(still trying to figure this out)
Я хочу захватить весь текст, который встречается между этими двумя совпадениями.
Конечная текстовая строка должна выглядеть следующим образом:
final_text = """ ITEM 1A. | RISK FACTORS
Set forth below and elsewhere in this report and in other documents we file
with the SEC are descriptions of the risks and uncertainties that could cause
our actual results to differ materially from the results contemplated by the
forward-looking statements contained in this report."""