совпадение строк заменить на регистр с учетом регистра - PullRequest
2 голосов
/ 29 июля 2011

Я новичок в Python и пытаюсь сделать что-то новое. У меня есть два списка в словаре. Допустим,

List1:                              List2:
Anterior                            cord
cuneate nucleus                     Medulla oblongata
nucleus                             Spinal cord
Intermediolateral nucleus           Spinal 
                                    sksdsj
british                             7

И у меня есть несколько строк текста, как показано ниже:

<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

Мне нужно вернуть те строки, которым принадлежит строка, как из list1, так и из list2.Так я пробовал использовать следующий код:

result = ""
if list1 in line and list2 in line:
    i1 = re.sub('(?i)(\s+)(%s)(\s+)'%list1, '\\1<e1>\\2</e1>\\3', line)
    i2 = re.sub('(?i)(\s+)(%s)(\s+)'%list2, '\\1<e2>\\2</e2>\\3', i1)
    result = result + i2 + "\n"
    continue

Но я получаю следующий результат:

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate <e1>nucleus</e1> is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

Здесь, только строка результата-4, которую я получил, совпадает со строкой из обоих списков, и это то, что я хочу. Но я не хочу получать те строки, которые соответствуют только одной строке или ее нет (например, строка результата - 1 и 3). Также, если совпадает строка из обоих списков, она должна пометить их (например, строка результата-2).

Любая помощь будет принята с благодарностью.

Ответы [ 2 ]

5 голосов
/ 29 июля 2011

По сути, вы хотите поместить некоторые слова в теги <e1> и другие слова в теги <e2>.Это правильно?

Если так, то что-то вроде этого подойдет:

#!/usr/bin/python

from __future__ import print_function
import re

text = '''\
<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>'''

list1 = ('Anterior', 'cuneate nucleus', 'Intermediolateral nucleus')
list2 = ('cord', 'Medulla oblongata', 'Spinal cord')

# put phrases in \b so that they match the whole words
re1 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list1), re.IGNORECASE)
re2 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list2), re.IGNORECASE)

for line in text.split("\n"):
    line = re1.sub(r"<e1>\1</e1>", line)
    line = re2.sub(r"<e2>\1</e2>", line)
    print(line)

Вывод:

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the <e2>spinal cord</e2> and the <e2>medulla oblongata</e2>.</s>
<s id="3691284-1">In the <e2>medulla oblongata</e2>, the arcuate nucleus is a group of neurons located on the <e1>anterior</e1> surface of the medullary pyramids.</s>
<s id="21120-99"><e1>Anterior</e1> horn cells, motoneurons located in the <e2>spinal cord</e2>.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the <e2>spinal cord</e2>, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the <e2>spinal cord</e2>.</s>
1 голос
/ 29 июля 2011

Как насчет этого:

result = ""
lines = ['<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>',
'<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>',
'<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>',
'<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>']

for line in lines:
    for item1 in list1:
        if line.find(item1) != -1:
            for item2 in list2:
                if line.find(item2) != -1:
                      result = result + line + '\n'
                      break
            break
print result
Добро пожаловать на сайт PullRequest, где вы можете задавать вопросы и получать ответы от других членов сообщества.
...