Ниже приведены коды для извлечения наиболее информативной функции для двоичной классификации для каждого класса:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=100):
"""
Identify most important features if given a vectorizer and binary classifier.
Set n to the number of weighted features you would like to show.
"""
def read_counter():
return loads(open("counter.json", "r").read()) + 1 if path.exists("counter.json") else 0
def write_counter():
with open("counter.json", "w") as f:
f.write(dumps(counter))
counter = 1
counter = read_counter()
atexit.register(write_counter)
if counter >= 7:
counter = 0
# additional stopwords to be remove
# Open a file and read it into memory
file = open('..\stopwords.txt')
additional_stopwords = file.read()
additional_stopwords = additional_stopwords.split()
class_labels = classifier.classes_
feature_names = vectorizer.get_feature_names()
feature_names = [word for word in feature_names if word not in additional_stopwords]
topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]
# class_labels = category
# coef = co-effecient
# feat = most informative feature
if 1 <= counter <= 6:
for coef, feat in topn_class1:
print(class_labels[counter - 1], coef, feat)
# print(class_labels)
# -> output: [2 3 4 5 6 7 8] index of this 0 1 2 3 4 5 6
print()
for coef, feat in reversed(topn_class2):
print(class_labels[counter], coef, feat)
else:
print("=== PLEASE RUN PROGRAM AGAIN TO VIEW THE CO-EFFICIENT FOR THE CHOSEN MODEL ===")
Ниже приведены результаты метода most_inforrative_feature_for_binary_classification:
При первом запуске программы (этопокажет информативную функцию классов 2 и 3):
2 -8.322094697329087 aaa
2 -8.322094697329087 aaa cm
2 -8.322094697329087 aaa cm underwent
2 -8.322094697329087 aaa free
2 -8.322094697329087 aaa free ivc
3 -8.010764835561018 assymetry imp giddiness
3 -8.144858449457846 admitted feb year
3 -8.164330364141858 agreeable dre brown
3 -8.172447581146958 aerobic anaerobic labeled
3 -8.180391164585233 actually body
Когда программа будет запущена во второй раз (информативная функция классов 3 и 4 будет показана):
3 -8.322580751462969 aaa
3 -8.322580751462969 aaa cm
3 -8.322580751462969 aaa cm underwent
3 -8.322580751462969 aaa free
3 -8.322580751462969 aaa free ivc
4 -8.0112508896949 assymetry moving
4 -8.145344503591728 admitted feb year
4 -8.16481641827574 agreeable dre brown
4 -8.17293363528084 aerobic anaerobic labeled
4 -8.180877218719115 actually body
Когда программазапустите в третий раз (это покажет информативную особенность классов 4 и 5):
4 -8.322337753927105 aaa
4 -8.322337753927105 aaa cm
4 -8.322337753927105 aaa cm underwent
4 -8.322337753927105 aaa free
4 -8.322337753927105 aaa free ivc
5 -8.011007892159036 assymetry imp
5 -8.145101506055864 admitted frequent falls
5 -8.164573420739876 agreeable early review
5 -8.172690637744976 af anticoagulation
5 -8.18063422118325 actually body
Если вы, ребята, поймете вышеуказанные результаты, они все одинаковы.Пожалуйста, помогите мне проверить коды для кодов методов most_inforrative_feature_binary_classification.Спасибо: ((
Пример counter.json:
1
Ниже приведены классификаторы, которые использовали метод most_inforrative_feature_binary_classification:
Наивный байесовский классификатор:
def NB_func():
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['cat_id'], random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf_NB = MultinomialNB().fit(X_train_tfidf, y_train)
# save the model to disk
filename = '../dataanalysis/models/Naive_Bayes.sav'
pickle.dump(clf_NB, open(filename, 'wb'))
print()
'''Print the prediction of the category from the unknown document'''
#For now its not accurate due to insufficient sample data
#print("NAIVE BAYES CLASSIFIER: ", clf_NB.predict(count_vect.transform([""])))
print ()
print("===============================================")
print("================= NAIVE BAYES =================")
print("===============================================")
most_informative_feature_for_binary_classification(tfidf, clf_NB, n=5)
Классификатор логистической регрессии:
def LR_func():
X_train, X_test, y_train, y_test = train_test_split(df['content'], df['cat_id'], random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf_LR = LogisticRegression().fit(X_train_tfidf, y_train)
print()
'''Print the prediction of the category from the unknown document'''
#For now its not accurate due to insufficient sample data
#print("LOGISTICS REGRESSION CLASSIFIER: ", clf_LR.predict(count_vect.transform([""])))
print ()
# save the model to disk
filename = '../dataanalysis/models/Logistics_Regression.sav'
pickle.dump(clf_LR, open(filename, 'wb'))
print()
print("===============================================")
print("============ LOGISTICS REGRESSION =============")
print("===============================================")
most_informative_feature_for_binary_classification(tfidf, clf_LR, n=5)