Я пытаюсь поиграть с регулярным выражением в Python 2.7 , чтобы поймать пронумерованные сноски в тексте.Мой текст, преобразованный из PDF, выглядит следующим образом:
test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim
Participation, in which it decided that the victims “may, through their legal
1
The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1.
2
A more detailed procedural history is set out in Annex 2 of this judgment.
ICC-01/04-02/12-271-Corr 07-04-2015 7/117 EK A
8/117
representatives, participate in the present appeal proceedings for the purpose of
presenting their views and concerns in respect of their personal interests in the issues
on appeal”.3
8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The
Prosecutor filed a confidential redacted version of the Document in Support of the
Appeal on 22 March 2013, and a public redacted version of the Document in Support
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of
the Appeal, the Prosecutor’s entire third ground of appeal was redacted.
"""
Обратите внимание, что нумерованные абзацы , которые являются обычным содержимым моего текста, имеют префикс с номером и точкой (например,'5.') .В идеале я хотел бы получить что-то вроде:
[(1,"The full citation, including the ICC registration reference of all designations and abbreviations used in
this judgment are included in Annex 1. "), (2, "A more detailed procedural history is set out in Annex 2 of this judgment."
Мой код Python для получения сносок:
regex = ur"""
(\r?\n)(?P<num>\d+)(?!\.) #first line
(?P<text>(?:\s(.|\r?\n)+?\s?(?:\n\n|\Z))) #following lines
"""
result = re.findall(regex, test_str, re.U|re.VERBOSE | re.X |re.MULTILINE)
, что дает мне:
[(u'\n', u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n\n', u'.')]
т.е. только первая сноска , в то время как мне нужны оба курса
Любые идеи приветствуются!