Если вы заинтересованы в более быстрой производительности и / или использовании принципов аккуратных данных, тогда вы можете вообще не использовать пакет tm. * * * * * * * * * * * * * * * * * * * * * * * * * * * 1 * * * * * * * * * * * * * * * * * * * * * * * * * * *} * * * * * * * * * * * * * * * * *. *. топи c моделирование . После того, как ваши данные находятся в памяти (я рекомендую использовать readr::read_lines()
с текстовыми файлами), вы должны сделать что-то вроде этого:
library(tidyverse)
library(tidytext)
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(janeaustenr)
austen_sparse <- austen_books() %>% ## austenbooks like the output of read_lines()
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(book, word) %>%
cast_sparse(book, word, n) ## cast_sparse() is what converts to a DTM
#> Joining, by = "word"
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE, init.type = "Spectral")
summary(topic_model)
#> A topic model with 12 topics, 6 documents and a 13914 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: acknowledgement, lyme, benwick, henrietta, musgrove, walter, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 2 Top Words:
#> Highest Prob: emma, miss, harriet, weston, knightley, elton, jane
#> FREX: weston, knightley, elton, woodhouse, fairfax, churchill, hartfield
#> Lift: _broke_, elton's, bates, elton, emma's, enscombe, fairfax
#> Score: emma, weston, knightley, elton, woodhouse, fairfax, harriet
#> Topic 3 Top Words:
#> Highest Prob: elinor, marianne, time, dashwood, sister, edward, mother
#> FREX: elinor, marianne, dashwood, jennings, willoughby, brandon, ferrars
#> Lift: 1811, dashwoods, jennings's, palmer, barton, berkeley, brandon
#> Score: elinor, marianne, dashwood, jennings, willoughby, lucy, brandon
#> Topic 4 Top Words:
#> Highest Prob: fanny, crawford, miss, sir, edmund, time, thomas
#> FREX: crawford, edmund, bertram, norris, rushworth, mansfield, julia
#> Lift: _allow_, bertram, crawford, crawford's, norris, rushworth, susan
#> Score: fanny, crawford, edmund, thomas, bertram, norris, rushworth
#> Topic 5 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: abbeys, average, camilla, causeless, closets, convent, cravats
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 6 Top Words:
#> Highest Prob: elizabeth, darcy, bennet, miss, jane, bingley, time
#> FREX: darcy, bennet, bingley, wickham, collins, lydia, lizzy
#> Lift: _accident_, lucas, bennet, bingley, bourgh, collins, darcy's
#> Score: darcy, elizabeth, bennet, bingley, wickham, collins, lydia
#> Topic 7 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: affrighted, andrews, average, blaize, camilla, causeless, closets
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 8 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: alicia, lyme, musgrove, walter, benwick, henrietta, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 9 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: alps, andrews, blaize, france, gloucestershire, heroic, heroine
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 10 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: antiquity, france, gloucestershire, heroic, lid, eleanor, eleanor's
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 11 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: archibald, lyme, walter, benwick, henrietta, kellynch, musgrove
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 12 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, anyone's
#> Lift: anyone's, eleanor, eleanor's, heroine, northanger, thorpe's, thorpes
#> Score: catherine, tilney, thorpe, morland, allen, anyone's, isabella
tidy(topic_model)
#> # A tibble: 166,968 x 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 1 1 1.18e- 4
#> 2 2 1 1.15e-19
#> 3 3 1 5.51e- 5
#> 4 4 1 1.33e-19
#> 5 5 1 4.20e- 5
#> 6 6 1 2.68e- 5
#> 7 7 1 4.20e- 5
#> 8 8 1 1.18e- 4
#> 9 9 1 4.20e- 5
#> 10 10 1 4.20e- 5
#> # … with 166,958 more rows
Создано в 2020-03-25 с помощью представ пакет (v0.3.0)