Я новичок в машинном обучении и хочу помочь с кластеризацией текста.
Пожалуйста, предложите изменения кода, если вы чувствуете.
Моя проблема состоит в том, чтобы кластеризовать входные данные в несколько кластеров. Для этого я использую tfidfvectorizer, запаривание, токенизацию и применение алгоритма k-средних.
В выводе я получаю идентичные данные в двух разных кластерах, однако я хочу, чтобы они были в одном и том же.
Ниже приведены примеры данных, которые у меня есть, и код, который я написал.
Seat Allocation has been delayed, please wait sometime., Next Update Date : 03/03/2016 16:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 06/07/2014 13:05:21
Seat Allocation has been delayed., Next Update Date : 06/03/2018 08:44:48
Seat Allocation has been delayed., Next Update Date : 23/02/2018 15:36:18
Seat Allocation has been delayed., Next Update Date : 08/03/2018 11:19:26
Seat Allocation has been delayed., Next Update Date : 20/03/2018 09:41:21
Seat Allocation has been delayed., Next Update Date : 27/07/2018 11:13:37
Seat Allocation has been delayed., Next Update Date : 22/01/2018 13:46:25
Need background Verification
Need background Verification
Need background Verification
Need background Verification
Sent for verification
Sent for verification
Sent for verification
Sent for verification
Данные: "Распределение мест отложено ...." относится к двум разным кластерам.
например,
Данные:
Seat Allocation has delayed, Next Update Date : 03/03/2015 16:05:21 0
Seat Allocation has delayed, Next Update Date : 03/04/2016 16:05:22 0
входит в кластер 0 и
Seat Allocation has been delayed, please wait sometime., Next Update Date : 04/05/2018 15:05:21
Seat Allocation has been delayed, please wait sometime., Next Update Date : 05/06/2013 14:05:21
входит в кластер 1.
Я пытался уменьшить и увеличить нет. кластеров также. Тем не менее это не работает. Я использовал язык программирования Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
get_ipython().run_line_magic('matplotlib', 'inline')
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
data = pd.read_excel("C:\\Users\\Desktop\\project\\SampleInput.xlsx")
punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)
desc = data['comments_long'].values
vectorizer = TfidfVectorizer(stop_words = stop_words)
X = vectorizer.fit_transform(desc)
word_features = vectorizer.get_feature_names()
#STEMMING and Tokenizing
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')
def tokenize(text):
return [stemmer.stem(word) for word in tokenizer.tokenize(text.lower())]
#vectorization with stop worlds
vectorizer2 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize)
X2 = vectorizer2.fit_transform(desc)
word_features2 = vectorizer2.get_feature_names()
print(len(word_features2))
print(word_features2[:50])
vectorizer3 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize, max_features = 1000)
X3 = vectorizer3.fit_transform(desc)
words = vectorizer3.get_feature_names()
#kmeans
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0)
kmeans.fit(X3)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.savefig('elbow.png')
plt.show()
true_k = 3
kmeans = KMeans(n_clusters = true_k, n_init = 20, n_jobs = 1) # n_init(number of iterations for clsutering) n_jobs(number of cpu cores to use)
kmeans.fit(X3)
# We look at 5 the clusters generated by k-means.
common_words = kmeans.cluster_centers_.argsort()[:,-1:-26:-1]
for num, centroid in enumerate(common_words):
print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))
data_new = pd.DataFrame()
data['Cluster_Id'] = kmeans.labels_
data_new['X']=desc
data_new['Cluster_Id'] = kmeans.labels_
data_new.to_excel('outputNew'+str(true_k)+'test.xlsx',sheet_name='All_Data', index=False)
from openpyxl import load_workbook
book = load_workbook('outputNew'+str(true_k)+'test.xlsx')
writer = pd.ExcelWriter('outputNew'+str(true_k)+'test.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
for i in range(true_k):
data_new[data_new['Cluster_Id']==i].to_excel(writer,sheet_name='cluster'+str(i), index=False)
writer.save()