Я пытаюсь сделать классификацию темы , используя файл pickle моей обученной модели, но я сталкиваюсь с ошибкой "CountVectorizer - Vocabulary not fit".Может кто-нибудь подсказать мне, как устранить эту ошибку.
Формат набора данных обучения:
Topic originalSentence
Topic1 He has arrived with his sister's two young children.
Topic2 The Lowells have been living off the Colby fortune
Topic3 Fred and Janice Gage, who live off the Lowell fortune, which would have gone to Alan Colby
Мой код обучения:
import pandas as pd
from io import StringIO
from sklearn.feature_extraction.text import TfidfVectorizer,TfidfTransformer,CountVectorizer
from sklearn.model_selection import train_test_split
import numpy as np
import pickle
def train_model():
df = pd.read_csv('/Users/ra51646/Desktop/classification_training.csv')
df = df[pd.notnull(df['originalSentence'])]
df.columns = ['topic', 'originalSentence']
df['category_id'] = df['topic'].factorize()[0]
category_id_df = df[['topic', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'topic']].values)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
features = tfidf.fit_transform(df.originalSentence).toarray()
labels = df.category_id
X_train, X_test, y_train, y_test = train_test_split(df['originalSentence'], df['topic'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf_SGD = SGDClassifier().fit(X_train_tfidf, y_train)
clf_inc = Incremental(clf_SGD)
final_model = clf_inc.fit(X_train_tfidf, y_train,classes=np.unique(y_train))
pickle.dump(final_model, open("/Users/ra51646/Desktop/Pickle/topic_classification.pkl","wb"))
(Ошибка вбыть решенным) Код, где я использую файл pickle для классификации тем:
def find_topic1():
model = pickle.load(open("/Users/ra51646/Desktop/Pickle/topic_classification.pkl","rb"))
count_vect = CountVectorizer()
answer = model.predict(count_vect.transform(["Lindy and her family went camping in the Outback"]))
print(answer[0])
return answer
Я получаю ошибку NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
в методе find_topic.Пожалуйста, помогите мне решить эту ошибку.Как я могу использовать свой файл рассола (обученную модель) для классификации тем.