Я ищу способы извлечения определенных абзацев из строк. У меня есть много документов, которые я хочу использовать для моделирования тем, но они содержат таблицы, рисунки, заголовки и т. Д. Я хочу использовать только сводку, которая обычно содержится в документе. Но резюме четко не объявлены.
Я преобразовал PDF-файлы в текст и попробовал что-то подобное, но это не сработало, потому что резюме всегда объявлялось по-другому:
def get_summary(text):
subject = ""
copy = False
textlines = text.splitlines()
for line in textlines:
#print line
if line.strip() == 'SUMMARY_BEGIN':
copy = True
elif line.strip() == 'SUMMARY_END':
copy = False
elif copy:
#print(line)
subject += line
return subject
Я не хочу искать сводку из 100 возможных подстрок.
Редактировать: похожий пример:
Date
21 Jun 2017
name name [abc]
name name [abc]
name name [cbd]
name name
name name
name name
name name
name name
nonsense-word1
nonsense-word1
nonsense-word1
12354
37264324
Summary:
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
32 463264
324324
324432
32424
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
324
24442
name name
name name
name name
name name
3244324324
Date
21 Jun 2017
Date
21 Jun 2017
Date
21 Jun 2017
electronically validated
electronically validated
electronically validated
electronically validated
electronically validated
763254 3276 4276457234