Кластеризация текста, интерпретация наивного байесовского и SVM кластеризация - PullRequest
0 голосов
/ 20 февраля 2020

Я пытаюсь выполнить кластеризацию данных, мой ожидаемый результат - n кластеров. Я нашел код на net, который успешно выполняет кластеризацию с использованием алгоритма 2 ML и говорит о том, что SVM лучше для данного набора данных, но как узнать, как выглядит кластер (не в форме графика, а в форме строки). Можем ли мы напечатать прогноз для каждой строки? Что такое прогнозируемые значения в приведенном ниже коде:

#20newspapers dataset - Pre-loaded with scikit learn
#Loading training dataset (will load test dataset later-on)
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

#You can check the target names (categories) and some data files by following commands.
twenty_train.target_names #prints all the categories
print("\n".join(twenty_train.data[0].split("\n")[:3])) #prints first line of the first data file

#________Extracting feature / tokenization___________
#Count
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)#Returns document term matrix
X_train_counts.shape#output the dimension of documemnt term matrix

#Just counting gives more weightage to longer document than short, so further processing as below:
#TF : count(word)/total word for each document
#TF-IDF: We can even reduce the weightage of more common words like (the, is, an etc.) which occurs in all 
#document. This is called as TF-IDF i.e Term Frequency times inverse document frequency.
#Both can be done in one line
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape #output the dimension of documemnt term matrix

#___________Running ML______
#Naive Bayes : trains the NB classifier on the training data
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

#__________BUILDING A PIPELINE: Can do above 3 step in one syntax
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
#The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary but will be used later.

#Testing the performance on test set : NAIVE BAYES
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)
np.mean(predicted)
np.mean([True, False])

#SVM
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf-svm', SGDClassifier(loss='hinge', penalty='l2',
                                               alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm = text_clf_svm.fit(twenty_train.data, twenty_train.target)
predicted_svm = text_clf_svm.predict(twenty_test.data)
np.mean(predicted_svm == twenty_test.target)
...