Edit: вот как вы можете добиться умножения матриц, о котором вы спрашивали. Отказ от ответственности: это может быть неосуществимо для очень большого корпуса.
Sklearn:
from sklearn.feature_extraction.text import CountVectorizer
Doc1 = 'Wimbledon is one of the four Grand Slam tennis tournaments, the others being the Australian Open, the French Open and the US Open.'
Doc2 = 'Since the Australian Open shifted to hardcourt in 1988, Wimbledon is the only major still played on grass'
docs = [Doc1, Doc2]
# Instantiate CountVectorizer and apply it to docs
cv = CountVectorizer()
doc_cv = cv.fit_transform(docs)
# Display tokens
cv.get_feature_names()
# Display tokens (dict keys) and their numerical encoding (dict values)
cv.vocabulary_
# Matrix multiplication of the term matrix
token_mat = doc_cv.toarray().T @ doc_cv.toarray()
Gensim:
import gensim as gs
import numpy as np
cp = [[(0, 2),
(1, 1),
(2, 1),
(3, 1),
(4, 11),
(7, 1),
(11, 2),
(13, 3),
(22, 1),
(26, 1),
(30, 1)],
[(4, 31),
(8, 2),
(13, 2),
(16, 2),
(17, 2),
(26, 1),
(28, 4),
(29, 1),
(30, 1)]]
# Convert to a dense matrix and perform the matrix multiplication
mat_1 = gs.matutils.sparse2full(cp[0], max(cp[0])[0]+1).reshape(1, -1)
mat_2 = gs.matutils.sparse2full(cp[1], max(cp[0])[0]+1).reshape(1, -1)
mat = np.append(mat_1, mat_2, axis=0)
mat_product = mat.T @ mat
Для слов, которые появляются последовательно, вы можете подготовить список биграмм для набора документов и затем использовать счетчик python для подсчета вхождений биграммы. Вот пример использования nltk.
import nltk
from nltk.util import ngrams
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter
stop_words = set(stopwords.words('english'))
# Get the tokens from the built-in collection of presidential inaugural speeches
tokens = nltk.corpus.inaugural.words()
# Futher text preprocessing
tokens = [t.lower() for t in tokens if t not in stop_words]
word_l = WordNetLemmatizer()
tokens = [word_l.lemmatize(t) for t in tokens if t.isalpha()]
# Create bigram list and count bigrams
bi_grams = list(ngrams(tokens, 2))
counter = Counter(bi_grams)
# Show the most common bigrams
counter.most_common(5)
Out[36]:
[(('united', 'state'), 153),
(('fellow', 'citizen'), 116),
(('let', 'u'), 99),
(('i', 'shall'), 96),
(('american', 'people'), 40)]
# Query the occurrence of a specific bigram
counter[('great', 'people')]
Out[37]: 7