Question

У меня есть два списка, и я хочу найти соответствующие элементы, используя python difflib / sequence matcher, и это выглядит так:

from difflib import SequenceMatcher
def match_seq(list1,list2):
    output=[]
    s = SequenceMatcher(None, list1, list2)
    blocks=s.get_matching_blocks()
    for bl in blocks:
        #print(bl, bl.a, bl.b, bl.size)
        for bi in range(bl.size):
            cur_a=bl.a+bi
            cur_b=bl.b+bi
            output.append((cur_a,cur_b))
    return output

поэтому, когда я запускаю его в двух списках, как это

list1=["orange","apple","lemons","grapes"]
list2=["pears", "orange","apple", "lemons", "cherry", "grapes"]
for a,b in match_seq(list1,list2):
    print(a,b, list1[a],list2[b])

Я получаю этот вывод:

(0, 1, 'orange', 'orange')
(1, 2, 'apple', 'apple')
(2, 3, 'lemons', 'lemons')
(3, 5, 'grapes', 'grapes')

но предположим, что я не хочу сопоставлять только идентичные элементы, а скорее использую функцию сопоставления (например, функцию, которая может сопоставлять оранжевый с апельсинами или наоборот, или сопоставлять эквивалентное слово на другом языке).

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

Есть ли какая-либо опция в difflib / sequence matcher или любой другой встроенной библиотеке python, которая может обеспечить это, чтобы я мог сопоставить list3 и list 4, а также list3 и list5, так же, как я сделал для list 1 и список2?

В общем, вы можете придумать решение для этого? Я думал о замене каждого слова в целевом списке возможными эквивалентами, которые я хочу сопоставить, но это может быть проблематично, потому что мне может потребоваться иметь несколько эквивалентов для каждого слова, что может нарушить последовательность.

jferard · Answer 1 · 12 марта 2019

У вас есть три основных решения: 1) написать собственную реализацию diff; 2) взломать модуль difflib; 3) найти обходной путь.

Ваша собственная реализация

В случае 1) вы можете посмотреть на этот вопрос и прочитайте несколько книг, таких как CLRS или книги Роберта Седжвика.

Взломать модуль `difflib`

В случае 2) посмотрите исходный код : get_matching_blocks, звонки find_longest_match на , линия 479, В ядре find_longest_match имеется словарь b2j, который отображает элементы списка a на их индексы в списке b. Если вы перезапишите этот словарь, вы можете достичь того, что вы хотите. Вот стандартная версия:

>>> import difflib
>>> from difflib import SequenceMatcher
>>> list3 = ["orange","apple","lemons","grape"]
>>> list4 = ["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> s = SequenceMatcher(None, list3, list4)
>>> s.get_matching_blocks()
[Match(a=1, b=2, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(1, 2, 'apple', 'apple')]

Вот взломанная версия:

>>> s = SequenceMatcher(None, list3, list4)
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5]}
>>> s.b2j = {**s.b2j, 'orange':s.b2j['oranges'], 'lemons':s.b2j['lemon'], 'grape':s.b2j['grapes']}
>>> s.b2j
{'pears': [0], 'oranges': [1], 'apple': [2], 'lemon': [3], 'cherry': [4], 'grapes': [5], 'orange': [1], 'lemons': [3], 'grape': [5]}
>>> s.get_matching_blocks()
[Match(a=0, b=1, size=3), Match(a=3, b=5, size=1), Match(a=4, b=6, size=0)]
>>> [(b.a+i, b.b+i, list3[b.a+i], list4[b.b+i]) for b in s.get_matching_blocks() for i in range(b.size)]
[(0, 1, 'orange', 'oranges'), (1, 2, 'apple', 'apple'), (2, 3, 'lemons', 'lemon'), (3, 5, 'grape', 'grapes')]

Это не сложно автоматизировать, но я бы не советовал вам это решение, поскольку существует очень простой обходной путь.

Обходной путь

Идея состоит в том, чтобы группировать слова по семьям:

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes"}]

Теперь легко создать словарь, который переводит каждое слово в семействе в одно из этих слов (назовем его главным словом):

>>> d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
>>> d
{'pears': 'peras', 'orange': 'naranjas', 'oranges': 'naranjas', 'manzana': 'apple', 'lemon': 'lemons', 'limón': 'lemons', 'cherry': 'cereza', 'grape': 'grapes'}

Обратите внимание, что main, *alternatives in map(list, families) распаковывает семейство в основное слово (первое в списке) и список альтернатив, используя оператор звездочки:

>>> head, *tail = [1,2,3,4,5]
>>> head
1
>>> tail
[2, 3, 4, 5]

Затем вы можете преобразовать списки, чтобы использовать только основные слова:

>>> list3=["orange","apple","lemons","grape"]
>>> list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
>>> list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]
>>> [d.get(w, w) for w in list3]
['naranjas', 'apple', 'limón', 'grapes']
>>> [d.get(w, w) for w in list4]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'grapes']
>>> [d.get(w, w) for w in list5]
['peras', 'naranjas', 'apple', 'limón', 'cereza', 'uvas']

Выражение d.get(w, w) вернет d[w], если w является ключом, иначе w само по себе. Следовательно, слова, принадлежащие семье, преобразуются в основное слово этой семьи, а остальные слова остаются нетронутыми.

Эти списки легко сравнить с difflib.

Важно: временная сложность преобразования списков незначительна по сравнению с алгоритмом сравнения, поэтому вы не должны видеть разницу.

Полный код

В качестве бонуса полный код:

def match_seq(list1, list2):
    """A generator that yields matches of list1 vs list2"""
    s = SequenceMatcher(None, list1, list2)
    for block in s.get_matching_blocks():
        for i in range(block.size):
            yield block.a + i, block.b + i # you don't need to store the matches, just yields them

def create_convert(*families):
    """Return a converter function that converts a list
    to the same list with only main words"""
    d = {w:main for main, *alternatives in map(list, families) for w in alternatives}
    return lambda L: [d.get(w, w) for w in L]

families = [{"pears", "peras"}, {"orange", "oranges", "naranjas"}, {"apple", "manzana"}, {"lemons", "lemon", "limón"}, {"cherry", "cereza"}, {"grape", "grapes", "uvas"}]
convert = create_convert(*families)

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]
list5=["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

print ("list3 vs list4")
for a,b in match_seq(convert(list3), convert(list4)):
    print(a,b, list3[a],list4[b])

#  list3 vs list4
# 0 1 orange oranges
# 1 2 apple apple
# 2 3 lemons lemon
# 3 5 grape grapes

print ("list3 vs list5")
for a,b in match_seq(convert(list3), convert(list5)):
    print(a,b, list3[a],list5[b])

# list3 vs list5
# 0 1 orange naranjas
# 1 2 apple manzana
# 2 3 lemons limón
# 3 5 grape uvas

cody · Answer 2 · 12 марта 2019

Вот подход, который использует класс, который наследует от UserString и переопределяет __eq__() и __hash__() так, что строки, которые считаются синонимами, оцениваются как равные:

import collections
from difflib import SequenceMatcher


class SynonymString(collections.UserString):
    def __init__(self, seq, synonyms, inverse_synonyms):
        super().__init__(seq)

        self.synonyms = synonyms
        self.inverse_synonyms = inverse_synonyms

    def __eq__(self, other):
        if self.synonyms.get(other) and self.data in self.synonyms.get(other):
            return True
        return self.data == other

    def __hash__(self):
        if str(self.data) in self.inverse_synonyms:
            return hash(self.inverse_synonyms[self.data])
        return hash(self.data)


def match_seq_syn(list1, list2, synonyms):

    inverse_synonyms = {
        string: key for key, value in synonyms.items() for string in value
    }

    list1 = [SynonymString(s, synonyms, inverse_synonyms) for s in list1]
    list2 = [SynonymString(s, synonyms, inverse_synonyms) for s in list2]

    output = []
    s = SequenceMatcher(None, list1, list2)
    blocks = s.get_matching_blocks()

    for bl in blocks:
        for bi in range(bl.size):
            cur_a = bl.a + bi
            cur_b = bl.b + bi
            output.append((cur_a, cur_b))
    return output


list3 = ["orange", "apple", "lemons", "grape"]
list5 = ["peras", "naranjas", "manzana", "limón", "cereza", "uvas"]

synonyms = {
    "orange": ["oranges", "naranjas"],
    "apple": ["manzana"],
    "pears": ["peras"],
    "lemon": ["lemons", "limón"],
    "cherry": ["cereza"],
    "grape": ["grapes", "uvas"],
}

for a, b in match_seq_syn(list3, list5, synonyms):
    print(a, b, list3[a], list5[b])

Результат (сравнение списков 3 и 5):

0 1 orange naranjas
1 2 apple manzana
2 3 lemons limón
3 5 grape uvas

ALFA · Answer 3 · 11 марта 2019

Допустим, вы хотите заполнить списки элементами, которые должны соответствовать друг другу.Я не пользовался никакой библиотекой, но Generators .Я не уверен в эффективности, я пробовал этот код один раз, но думаю, что он должен работать довольно хорошо.

orange_list = ["orange", "oranges"] # Fill this with orange matching words
pear_list = ["pear", "pears"]
lemon_list = ["lemon", "lemons"]
apple_list = ["apple", "apples"]
grape_list = ["grape", "grapes"]

lists = [orange_list, pear_list, lemon_list, apple_list, grape_list] # Put your matching lists inside this list

def match_seq_bol(list1, list2):
    output=[]
    for x in list1:
        for lst in lists:
            matches = (y for y in list2 if (x in lst and y in lst))
            if matches:
                for i in matches:
                    output.append((list1.index(x), list2.index(i), x,i))
    return output;

list3=["orange","apple","lemons","grape"]
list4=["pears", "oranges","apple", "lemon", "cherry", "grapes"]

print(match_seq_bol(list3, list4))

match_seq_bol() означает совпадение последовательностей на основе списков .

Выходные данные, соответствующие list3 и list4, будут:

[
    (0, 1, 'orange', 'oranges'),
    (1, 2, 'apple', 'apple'),
    (2, 3, 'lemons', 'lemon'),
    (3, 5, 'grape', 'grapes')
]

Python match matcher с пользовательской функцией сопоставления

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Ваша собственная реализация

Взломать модуль `difflib`

Обходной путь

Полный код

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Python match matcher с пользовательской функцией сопоставления

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 3 ]

Ваша собственная реализация

Взломать модуль difflib

Обходной путь

Полный код

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы

Взломать модуль `difflib`