Question

Я новичок в Python и был бы очень благодарен, если бы вы могли помочь мне с моей проблемой извлечения текста.

Я хочу извлечь весь текст, который лежит между двумя выражениями в текстовом файле (начало и конец буквы). Как для начала, так и для конца буквы существует несколько возможных выражений (определенных в списках «letter_begin» и «letter_end», например, «Dear», «to our» и т. Д.). Я хочу проанализировать это для нескольких файлов, найдите ниже пример того, как выглядит такой текстовый файл -> Я хочу извлечь весь текст, начиная с «Уважаемый» до «Дуглас». В тех случаях, когда «letter_end» не соответствует, то есть не найдено выражение letter_end, вывод должен начинаться с letter_beginning и заканчиваться в самом конце анализируемого текстового файла.

Редактировать: конец «записанного текста» должен быть после совпадения «letter_end» и перед первой строкой с 20 или более символами (как в случае «Случайного текста и здесь» -> len = 24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

Пока это мой код - но он не может гибко перехватывать текст между выражениями (может быть что угодно (строки, текст, числа, знаки и т. Д.) До «letter_begin» и после «letter_end» «)

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

Я очень благодарен за любую помощь!

Wiktor Stribiżew · Answer 1 · 06 ноября 2018

Вы можете использовать

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

Этот шаблон приведет к регулярному выражению типа

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

См. Демоверсию regex . Обратите внимание, что вы не должны использовать re.DOTALL с этим шаблоном, и опция re.MULTILINE также является избыточной.

Детали

(?:dear|to our|estimated) - любое из трех значений
[\s\S]*? - любые 0+ символов, как можно меньше
(?:sincerely|yours|best regards) - любое из трех значений
.* - любые 0+ символов, кроме новой строки
(?:\n.*){0,2} - ноль, один или два повторения новой строки с последующими 0+ символами, кроме новой строки.

Демонстрационный код Python :

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

Выход:

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Python Regex - извлечение текста между (несколькими) выражениями в текстовом файле

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Python Regex - извлечение текста между (несколькими) выражениями в текстовом файле

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы