Прежде всего, вы можете добавить имена документов в ваш корпус:
document_names <- c("doc1", "doc2", "doc3")
a_corpus <- quanteda::corpus(x = c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"),
docnames = document_names)
a_corpus
# Corpus consisting of 3 documents and 0 docvars.
Теперь у вас есть имена документов, доступные при последующих вызовах функций Quanteda.
ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
ngrams_dfm
# Document-feature matrix of: 3 documents, 43 features (63.6% sparse).
Вы можететакже используйте параметр groups в textstat_frequency
, чтобы получить имена документов в результатах частоты
freq = textstat_frequency(ngrams_dfm, groups = docnames(ngrams_dfm))
head(freq)
feature frequency rank docfreq group
1 some_corpus 1 1 1 doc1
2 corpus_text 1 2 1 doc1
3 text_of 1 3 1 doc1
4 of_no 1 4 1 doc1
5 no_consequence 1 5 1 doc1
6 consequence_that 1 6 1 doc1
Если вы хотите получить данные из ngrams_dfm в data.frame, в quanteda есть функция convert
:
convert(ngrams_dfm, to = "data.frame")
document some_corpus corpus_text text_of of_no no_consequence consequence_that that_in in_practice practice_is is_going going_to to_be
1 doc1 1 1 1 1 1 1 1 1 1 1 1 1
2 doc2 0 0 0 0 0 0 0 0 0 0 0 0
3 doc3 1 1 0 0 0 0 0 0 0 0 0 0
Вы можете изменить это, чтобы получить то, что вы хотите: вот пример с dplyr / tidyr.
library(dplyr)
convert(ngrams_dfm, to = "data.frame") %>%
tidyr::gather(feature, frequency, -document) %>%
group_by(document, feature) %>%
summarise(frequency = sum(frequency))
# A tibble: 129 x 3
# Groups: document [?]
document feature frequency
<chr> <chr> <dbl>
1 doc1 a_very 0
2 doc1 about_top 0
3 doc1 adding_some 0
4 doc1 and_so 0
5 doc1 approaches_are 0
6 doc1 are_working 0
7 doc1 be_very 1
8 doc1 but_for 0
9 doc1 care_about 0
10 doc1 consequence_that 1
# ... with 119 more rows
или с data.table:
out <- data.table(convert(ngrams_dfm, to = "data.frame"))
melt(out, id.vars = "document",
variable.name = "feature", value.name = "freq")
document feature freq
1: doc1 some_corpus 1
2: doc2 some_corpus 0
3: doc3 some_corpus 1
4: doc1 corpus_text 1
5: doc2 corpus_text 0
---
125: doc2 care_about 1
126: doc3 care_about 0
127: doc1 about_top 0
128: doc2 about_top 1
129: doc3 about_top 0