Кластерные документы из матрицы тематических документов - PullRequest
0 голосов
/ 03 июля 2019

Очень простой вопрос и новый вопрос. У меня есть матрица документа-темы и я хочу объединить похожие документы по темам. Какой самый лучший процесс? Пример кода приветствуется.

Некоторая справочная информация, я смог преобразовать CSV ниже в матрицу, используя pandas.pivot_table и запустил sklearn.cluster.KMeans. Что я не думаю дает правильные результаты, он принял во внимание числовые метки от имени документа в качестве оси для создания кластера.

Пример результирующего набора для темы документа в формате CSV

docname,topic,proportion
doc1,081,1.0
doc2,076,1.0
doc3,035,0.894904
doc3,014,0.031128
doc3,022,0.023992
doc3,026,0.018978
doc3,083,0.018909
doc3,019,0.012089
doc4,060,0.874393
doc4,014,0.033226
doc4,010,0.022646
doc4,018,0.014915
doc4,042,0.013806
doc4,036,0.013683
doc4,086,0.009993
doc4,015,0.009027
doc4,002,0.00443
doc4,092,0.003881
doc5,002,0.915456
doc5,014,0.031435
doc5,038,0.022754
doc5,032,0.013058
doc5,039,0.007087
doc5,013,0.005142
doc5,017,0.005069
doc6,040,0.363076
doc6,014,0.099822
doc6,022,0.082712
doc6,023,0.077025
doc6,015,0.072956
doc6,009,0.065788
doc6,018,0.034809
doc6,024,0.032451
doc6,064,0.032443
doc6,019,0.031513
doc7,038,0.901264
doc7,063,0.040506
doc7,014,0.019326
doc7,027,0.01604
doc7,028,0.012413
doc7,021,0.010451
doc8,020,0.936566
doc8,018,0.012819
doc8,032,0.01087
doc8,009,0.01062
doc8,083,0.010247
doc8,014,0.005821
doc8,090,0.00462
doc8,045,0.004255
doc8,076,0.004184
...