Я выполняю классификацию текста, используя word2vec.
Вот моя функция.
def get_w2v_features(w2v_model, sentence_group):
""" Transform a sentence_group (containing multiple lists
of words) into a feature vector. It averages out all the
word vectors of the sentence_group.
"""
words = np.concatenate(sentence_group) # words in text
index2word_set = set(w2v_model.wv.vocab.keys()) # words known to model
featureVec = np.zeros(w2v_model.vector_size, dtype="float32")
# Initialize a counter for number of words in a review
nwords = 0
# Loop over each word in the comment and, if it is in the model's vocabulary, add its feature vector to the total
for word in words:
if word in index2word_set:
featureVec = np.add(featureVec, w2v_model[word])
nwords += 1.
# Divide the result by the number of words to get the average
if nwords > 0:
featureVec = np.divide(featureVec, nwords)
return featureVec
data['w2v_features'] = list(map(lambda sen_group:
get_w2v_features(W2Vmodel, sen_group),
data.tokenized_sentences))
данные выглядят так, функции хорошо созданы, как показано ниже.
датафрейм с левой стороны
датафрейм справа
Теперь, когда я запускаю его поверх моих тестовых данных, у меня есть те же данные с предложениями с токенами, что и данные о поездах, но я получаю ошибку.
test_data['w2v_features'] = list(map(lambda sen_group:
get_w2v_features(W2Vmodel, sen_group),
test_data.tokenized_sentences))
ValueError Traceback (most recent call last)
<ipython-input-73-b32852b152cb> in <module>()
1 test_data['w2v_features'] = list(map(lambda sen_group:
2 get_w2v_features(W2Vmodel, sen_group),
----> 3 test_data.tokenized_sentences))
<ipython-input-73-b32852b152cb> in <lambda>(sen_group)
1 test_data['w2v_features'] = list(map(lambda sen_group:
----> 2 get_w2v_features(W2Vmodel, sen_group),
3 test_data.tokenized_sentences))
<ipython-input-34-cd5ee3d13b85> in get_w2v_features(w2v_model, sentence_group)
4 word vectors of the sentence_group.
5 """
----> 6 words = np.concatenate(sentence_group) # words in text
7 index2word_set = set(w2v_model.wv.vocab.keys()) # words known to model
8
ValueError: need at least one array to concatenate