Я пытаюсь создать матрицу терминов документа (dtm
), но столкнулся с приведенной ниже ошибкой.
dtm <- CreateDtm(tokens$text,
stopword_vec = c(stopwords::stopwords("en")),
doc_names = tokens$ID,
ngram_window = c(1, 2),lower = TRUE, remove_punctuation = TRUE, remove_numbers = TRUE)
Ошибка:
Ошибка в seq. default (1, length (tokens), 5000): неправильный вход в аргумент 'by'
dput(head(tokens))
дает следующее:
structure(list(Index = c(0, 1, 2, 3, 4, 5), Paper = c("9201001",
"9201002", "9201003", "9201004", "9201005", "9201006"), `1` = c("combinatorics",
"inomogeneous", "intersection", "heterotic", "ward", "symmetries"
), `2` = c("modular", "quantum", "theory", "green", "identities",
"massless"), `3` = c("ii", "symmetries", "integrable", "schwarz",
"dimensional", "field"), `4` = c("", "phonons", "hierarchies",
"superstring", "string", "theories"), `5` = c("", "", "topological",
"super", "theory", ""), `6` = c("", "", "field", "worldsheet",
"", ""), `7` = c("", "", "theory", "", "", ""), `8` = c("", "",
"", "", "", ""), `9` = c("", "", "", "", "", ""), `10` = c("",
"", "", "", "", ""), `11` = c("", "", "", "", "", ""), `12` = c("",
"", "", "", "", "")), row.names = c(NA, -6L), groups = structure(list(
Paper = c("9201001", "9201002", "9201003", "9201004", "9201005",
"9201006"), .rows = list(1L, 2L, 3L, 4L, 5L, 6L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"), .drop = FALSE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Интересно, что пошло не так? Заранее спасибо.