Как насчет:
import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
print i.groupdict()
Выход:
{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}
Редактировать
Как многословный RE, чтобы сделать его более понятным (я думаю, что подробные RE следует использовать для всего, кроме самых простых выражений, но это только мое мнение!):
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
(?: # Non capturing group with multiple options, first option:
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
| # Next option:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
| # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
)
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
Обратите внимание, что вы можете заменить ^PG
и ^AD
на ^\S
, чтобы сделать его более общим (вы хотите сопоставить все до первого непробела в начале строки).
Редактировать 2
Если вы хотите поймать все это в одном регулярном выражении, избавьтесь от начального (?:
, конечного )
и измените символы |
на .*?
:
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
.*? # Next part:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
.*? # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
Это дает:
{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}