печатать темы из LDA - PullRequest
       12

печатать темы из LDA

0 голосов
/ 11 июня 2018

Я сгенерировал следующий синтаксис Python:

Создать новую модель CountVectorizer без стоп-слов

cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures", vocabSize = 1000)
cvmodel = cv.fit(wordsDataFrame)
df_vect = cvmodel.transform(wordsDataFrame)
vacab=cvmodel.vocabulary

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(df_vect)
rescaledData = idfModel.transform(df_vect) # TFIDF
rescaledData.show(5, truncate=False)

Показывает следующие данные

+--------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
|vector                                                                                            |filtered                                                                              |rawFeatures                                           |features                                                                                                                                   |
+--------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
|[born, of, fury:, the, league:, nemesis, rising, (the, league:, nemesis, rising, series, book, 7)]|[born, of, fury:, the, league:, nemesis, rising, league:, nemesis, rising, series, 7)]|(1000,[0,1,9,213,423,601],[1.0,1.0,1.0,1.0,2.0,1.0])  |(1000,[0,1,9,213,423,601],[0.8399970566419033,1.3207393220753736,3.588772706976508,5.899551256459104,12.870815036534346,6.735710150128001])|
|[mr., beautiful, (up, in, the, air, book, 4)]                                                     |[beautiful, in, the, 4)]                                                              |(1000,[0,3,37,448],[1.0,1.0,1.0,1.0])                 |(1000,[0,3,37,448],[0.8399970566419033,2.3785215263148416,4.670066572599593,6.4924494566541915])                                           |
|[law, 101:, everything, you, need, to, know, about, american, law,, fourth, edition]              |[everything, need, to, know, about, law,, fourth, edition]                            |(1000,[4,39,74,241,371,595],[1.0,1.0,1.0,1.0,1.0,1.0])|(1000,[4,39,74,241,371,595],[2.5707044711621987,4.687844818620877,5.181231121974657,6.018783777801941,6.290064964714306,6.698960607919259])|
|[the, queen, of, four, kingdoms]                                                                  |[the, queen, of, four, kingdoms]                                                      |(1000,[0,1,183,208],[1.0,1.0,1.0,1.0])                |(1000,[0,1,183,208],[0.8399970566419033,1.3207393220753736,5.769356827395366,5.921813452946576])                                           |
|[the, long, utopia, (the, long, earth, book, 4)]                                                  |[the, long, utopia, long, earth, 4)]                                                  |(1000,[0,37,226,267],[1.0,1.0,2.0,1.0])               |(1000,[0,37,226,267],[0.8399970566419033,4.670066572599593,11.925863399752096,6.062820245655881])                                          |
+--------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------+------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows



from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.ml.clustering import LDA


lda = LDA(k=2, seed=1, optimizer="em",featuresCol="features")
ldamodel = lda.fit(rescaledData)

ldamodel.isDistributed()
ldamodel.vocabSize()

ldatopics = ldamodel.describeTopics()
ldatopics.show(4, truncate=False)

У меня есть следующее:

+-----+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                     |termWeights                                                                                                                                                                                                                |
+-----+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[0, 1, 2, 3, 4, 5, 6, 11, 7, 15]|[0.029602078686188613, 0.02568681483516444, 0.02150828913028423, 0.014503573554581532, 0.012339484082166114, 0.009470486581565513, 0.008819748910895639, 0.007354415224764272, 0.0067402576543246885, 0.006714357298177835]|
|1    |[0, 1, 2, 3, 4, 6, 5, 7, 10, 8] |[0.030649713015016164, 0.02664561321366636, 0.022549172400914988, 0.015334567312673663, 0.014809299873922247, 0.009715078063864171, 0.009074756787551434, 0.006890872876020724, 0.006788141630992759, 0.00675036500454095] |
+-----+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Как напечатать реальные слова из таблицы выше?

...