Question

Мне нужно сравнить документы, хранящиеся в БД, и получить оценку сходства от 0 до 1.

Метод, который мне нужно использовать, должен быть очень простым. Реализация ванильной версии n-грамм (где можно определить, сколько грамм использовать), а также простая реализация сходства tf-idf и Cosine.

Есть ли программы, которые могут это сделать? Или я должен начать писать это с нуля?

roman · Answer 1 · 02 мая 2010

Проверьте пакет NLTK: http://www.nltk.org в нем есть все, что вам нужно

Для косинуса_схожести:


def cosine_distance(u, v):
    """
    Returns the cosine of the angle between vectors v and u. This is equal to
    u.v / |u||v|.
    """
    return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v)))

Для нграмм:


def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
    """
    A utility that produces a sequence of ngrams from a sequence of items.
    For example:

    >>> ngrams([1,2,3,4,5], 3)
    [(1, 2, 3), (2, 3, 4), (3, 4, 5)]

    Use ingram for an iterator version of this function.  Set pad_left
    or pad_right to true in order to get additional ngrams:

    >>> ngrams([1,2,3,4,5], 2, pad_right=True)
    [(1, 2), (2, 3), (3, 4), (4, 5), (5, None)]

    @param sequence: the source data to be converted into ngrams
    @type sequence: C{sequence} or C{iterator}
    @param n: the degree of the ngrams
    @type n: C{int}
    @param pad_left: whether the ngrams should be left-padded
    @type pad_left: C{boolean}
    @param pad_right: whether the ngrams should be right-padded
    @type pad_right: C{boolean}
    @param pad_symbol: the symbol to use for padding (default is None)
    @type pad_symbol: C{any}
    @return: The ngrams
    @rtype: C{list} of C{tuple}s
    """

    if pad_left:
        sequence = chain((pad_symbol,) * (n-1), sequence)
    if pad_right:
        sequence = chain(sequence, (pad_symbol,) * (n-1))
    sequence = list(sequence)

    count = max(0, len(sequence) - n + 1)
    return [tuple(sequence[i:i+n]) for i in range(count)]

для tf-idf сначала вам нужно будет вычислить распределение, я использую Lucene для этого, но вы вполне можете сделать что-то подобное с NLTK, используйте FreqDist:

http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html#frequency_distribution_index_term

если вы любите пилицен, это скажет вам, как это сделать tf.idf

    # reader = lucene.IndexReader(FSDirectory.open(index_loc))
    docs = reader.numDocs()
    for i in xrange(docs):
        tfv = reader.getTermFreqVector(i, fieldname)
        if tfv:
            rec = {}
            terms = tfv.getTerms()
            frequencies = tfv.getTermFrequencies()
            for (t,f,x) in zip(terms,frequencies,xrange(maxtokensperdoc)):
                    df= searcher.docFreq(Term(fieldname, t)) # number of docs with the given term
                        tmap.setdefault(t, len(tmap))
                        rec[t] = sim.tf(f) * sim.idf(df, max_doc)  #compute TF.IDF
            # and normalize the values using cosine normalization
            if cosine_normalization:
                denom = sum([x**2 for x in rec.values()])**0.5
                for k,v in rec.items():
                    rec[k] = v / denom

Tarantula · Answer 2 · 22 октября 2011

Если вам интересно, я сделал серию уроков ( Часть I и Часть II ), рассказывающих о tf-idf и использующих Scikits.learn (sklearn) Модуль Python.

Часть 3 имеет косинусное сходство.

alvas · Answer 3 · 22 марта 2014

Вот ответ только с python + numpy, короче:

косинус

def cosine_sim(u,v):
    return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

Ngrams :

def ngrams(sentence, n):
  return zip(*[sentence.split()[i:] for i in range(n)])

TF-IDF (это немного странно, но работает):

def tfidf(corpus, vocab):
    """
    INPUT:

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this']

    OUTPUT:

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

    """
    def termfreq(matrix, doc, term):
        try: return matrix[doc][term] / float(sum(matrix[doc].values()))
        except ZeroDivisionError: return 0
    def inversedocfreq(matrix, term):
        try: 
            return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
        except ZeroDivisionError: return 0

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus]
    tfidf = defaultdict(dict)
    for doc,_ in enumerate(matrix):
        for term in matrix[doc]:
            tf = termfreq(matrix,doc,term)
            idf = inversedocfreq(matrix, term)
            tfidf[doc][term] = tf*idf

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]

Вот длинный ответ с тестами:

import numpy as np
from math import sqrt, log
from itertools import chain, product
from collections import defaultdict

def cosine_sim(u,v):
    return np.dot(u,v) / (sqrt(np.dot(u,u)) * sqrt(np.dot(v,v)))

def ngrams(sentence, n):
  return zip(*[sentence.split()[i:] for i in range(n)])

def tfidf(corpus, vocab):
    """
    INPUT:

    corpus = [('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), 
    ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), 
    ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]

    vocab = ['a', 'bar', 'black', 'foo', 'is', 'sentence', 
    'sheep', 'this']

    OUTPUT:

    [[0.300, 0.300, 0.0, 0.300, 0.300, 0.0, 0.0, 0.300], 
    [0.0, 0.600, 0.600, 0.300, 0.0, 0.0, 0.600, 0.0], 
    [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

    """
    def termfreq(matrix, doc, term):
        try: return matrix[doc][term] / float(sum(matrix[doc].values()))
        except ZeroDivisionError: return 0
    def inversedocfreq(matrix, term):
        try: 
            return float(len(matrix)) /sum([1 for i,_ in enumerate(matrix) if matrix[i][term] > 0])
        except ZeroDivisionError: return 0

    matrix = [{k:v for k,v in zip(vocab, i[1])} for i in corpus]
    tfidf = defaultdict(dict)
    for doc,_ in enumerate(matrix):
        for term in matrix[doc]:
            tf = termfreq(matrix,doc,term)
            idf = inversedocfreq(matrix, term)
            tfidf[doc][term] = tf*idf

    return [[tfidf[doc][term] for term in vocab] for doc,_ in enumerate(tfidf)]


def corpus2vectors(corpus):
    def vectorize(sentence, vocab):
        return [sentence.split().count(i) for i in vocab]
    vectorized_corpus = []
    vocab = sorted(set(chain(*[i.lower().split() for i in corpus])))
    for i in corpus:
        vectorized_corpus.append((i, vectorize(i, vocab)))
    return vectorized_corpus, vocab

def create_test_corpus():
    sent1 = "this is a foo bar"
    sent2 = "foo bar bar black sheep"
    sent3 = "this is a sentence"

    all_sents = [sent1,sent2,sent3]
    corpus, vocab = corpus2vectors(all_sents)
    return corpus, vocab

def test_cosine():
    corpus, vocab = create_test_corpus()

    for sentx, senty in product(corpus, corpus):
        print sentx[0]
        print senty[0]
        print "cosine =", cosine_sim(sentx[1], senty[1])
        print

def test_ngrams():
    corpus, vocab = create_test_corpus()
    for sentx in corpus:
        print sentx[0]
        print ngrams(sentx[0],2)
        print ngrams(sentx[0],3)
        print

def test_tfidf():
    corpus, vocab = create_test_corpus()
    print corpus
    print vocab
    print tfidf(corpus, vocab)

print "Testing cosine..."
test_cosine()
print
print "Testing ngrams..."
test_ngrams()
print
print "Testing tfidf..."
test_tfidf()
print

[выход]:

Testing cosine...
this is a foo bar
this is a foo bar
cosine = 1.0

this is a foo bar
foo bar bar black sheep
cosine = 0.507092552837

this is a foo bar
this is a sentence
cosine = 0.67082039325

foo bar bar black sheep
this is a foo bar
cosine = 0.507092552837

foo bar bar black sheep
foo bar bar black sheep
cosine = 1.0

foo bar bar black sheep
this is a sentence
cosine = 0.0

this is a sentence
this is a foo bar
cosine = 0.67082039325

this is a sentence
foo bar bar black sheep
cosine = 0.0

this is a sentence
this is a sentence
cosine = 1.0


Testing ngrams...
this is a foo bar
[('this', 'is'), ('is', 'a'), ('a', 'foo'), ('foo', 'bar')]
[('this', 'is', 'a'), ('is', 'a', 'foo'), ('a', 'foo', 'bar')]

foo bar bar black sheep
[('foo', 'bar'), ('bar', 'bar'), ('bar', 'black'), ('black', 'sheep')]
[('foo', 'bar', 'bar'), ('bar', 'bar', 'black'), ('bar', 'black', 'sheep')]

this is a sentence
[('this', 'is'), ('is', 'a'), ('a', 'sentence')]
[('this', 'is', 'a'), ('is', 'a', 'sentence')]


Testing tfidf...
[('this is a foo bar', [1, 1, 0, 1, 1, 0, 0, 1]), ('foo bar bar black sheep', [0, 2, 1, 1, 0, 0, 1, 0]), ('this is a sentence', [1, 0, 0, 0, 1, 1, 0, 1])]
['a', 'bar', 'black', 'foo', 'is', 'sentence', 'sheep', 'this']
[[0.30000000000000004, 0.30000000000000004, 0.0, 0.30000000000000004, 0.30000000000000004, 0.0, 0.0, 0.30000000000000004], [0.0, 0.6000000000000001, 0.6000000000000001, 0.30000000000000004, 0.0, 0.0, 0.6000000000000001, 0.0], [0.375, 0.0, 0.0, 0.0, 0.375, 0.75, 0.0, 0.375]]

Fred Foo · Answer 4 · 29 июля 2010

Если вы все еще заинтересованы в этой проблеме, я сделал нечто очень похожее, используя Lucene Java и Jython. Вот некоторые фрагменты из моего кода.

Lucene предварительно обрабатывает документы и запросы с использованием так называемых анализаторов. Этот использует встроенный n-граммовый фильтр Lucene:

class NGramAnalyzer(Analyzer):
    '''Analyzer that yields n-grams for minlength <= n <= maxlength'''
    def __init__(self, minlength, maxlength):
        self.minlength = minlength
        self.maxlength = maxlength
    def tokenStream(self, field, reader):
        lower = ASCIIFoldingFilter(LowerCaseTokenizer(reader))
        return NGramTokenFilter(lower, self.minlength, self.maxlength)

Чтобы превратить список ngrams в Document:

doc = Document()
doc.add(Field('n-grams', ' '.join(ngrams),
        Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES))

Чтобы сохранить документ в индексе:

wr = IndexWriter(index_dir, NGramAnalyzer(), True,
                 IndexWriter.MaxFieldLength.LIMITED)
wr.addDocument(doc)

Построение запросов немного сложнее, поскольку Lucene QueryParser ожидает язык запросов со специальными операторами, кавычками и т. Д., Но его можно обойти (как объяснено здесь ).

Penang · Answer 5 · 04 марта 2010

В нашем курсе поиска информации мы используем код, написанный нашим профессором на Java. Извините, нет порта Python. «Он выпускается в образовательных и исследовательских целях только в соответствии с GNU General Public License».

Вы можете ознакомиться с документацией http://userweb.cs.utexas.edu/~mooney/ir-course/doc/

Но, более конкретно, проверьте: http://userweb.cs.utexas.edu/users/mooney/ir-course/doc/ir/vsr/HashMapVector.html

Вы можете скачать его http://userweb.cs.utexas.edu/users/mooney/ir-course/

Простая реализация сходства N-Gram, TF-IDF и Cosine в Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Простая реализация сходства N-Gram, TF-IDF и Cosine в Python

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 5 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы