Ошибка Doc2Vec: нужен хотя бы один массив для объединения - PullRequest
2 голосов
/ 10 апреля 2019

Я сталкиваюсь с ошибкой при попытке применить модель doc2vec к некоторому тексту.Учебник, которому я следую, это здесь .Однако я не могу «воспроизвести» результаты в какой-то новой текстовой информации.

Я читал другие SO сообщения об этой проблеме и ее, потому что у меня есть пустой список, но я не знаю, почему у меня этот пустой список.

Код:

import pandas as pd

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import csv
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

list_id = list(df["id"])
list_def = list(df["text"])

tagged_data = [TaggedDocument(words=word_tokenize(term_def.lower()), tags=[list_id[i]]) for i, term_def in enumerate(list_def)]

max_epochs = 500
vec_size = 100
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm=1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 100 == 0:
        print('iteration {0}'.format(epoch))

    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)

    model.alpha -= 0.0002
    model.min_alpha = model.alpha

doc_tags = list(model.docvecs.doctags.keys())
X = model[doc_tags]

Ошибка, с которой я столкнулся, касается последних двух строк кода.

ValueError: need at least one array to concatenate

Данные:

d = {'text': ["Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2] These different variations of football are known as football codes.", "Rugby union, commonly known in most of the world simply as rugby,[3] is a contact team sport which originated in England in the first half of the 19th century.[4] One of the two codes of rugby football, it is based on running with the ball in hand. In its most common form, a game is between two teams of 15 players using an oval-shaped ball on a rectangular field with H-shaped goalposts at each end.", "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.", "Formula One (also Formula 1 or F1) is the highest class of single-seater auto racing sanctioned by the Fédération Internationale de l'Automobile (FIA) and owned by the Formula One Group. The FIA Formula One World Championship has been one of the premier forms of racing around the world since its inaugural season in 1950. The word formula in the name refers to the set of rules to which all participants' cars must conform.[1] A Formula One season consists of a series of races, known as Grands Prix (French for 'grand prizes' or 'great prizes'), which take place worldwide on purpose-built circuits and on public roads."], 'id': [123, 1234, 12345, 123456]}
df = pd.DataFrame(data=d)

РЕДАКТИРОВАТЬ:

iteration 0
iteration 100
iteration 200
iteration 300
iteration 400
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-45c9e3dc04ad> in <module>()
     36 
     37 doc_tags = list(model.docvecs.doctags.keys())
---> 38 X = model[doc_tags]

~\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in __getitem__(self, tag)
    961                 return self.docvecs[tag]
    962             return self.wv[tag]
--> 963         return vstack([self[i] for i in tag])
    964 
    965     def __str__(self):

~\Anaconda3\lib\site-packages\numpy\core\shape_base.py in vstack(tup)
    232 
    233     """
--> 234     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    235 
    236 def hstack(tup):

ValueError: need at least one array to concatenate
...