Вы можете попробовать альтернативный пакет: quanteda . Позволяет удалять стоп-слова после токенизации или после создания матрицы элементов документа. Ниже я использовал pad = TRUE
просто, чтобы показать слоты, в которых были удалены токены, соответствующие стоп-словам.
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
text <- "text is my text but also his text."
tokens(text) %>%
tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" "" "" "text" "" "also" "" "text" "."
В качестве альтернативы:
dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs text is my but also his .
## text1 3 1 1 1 1 1 1
dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs text also
## text1 3 1
Список английских стоп-слов - это просто символьный вектор, возвращаемый функцией stopwords()
(которая на самом деле происходит из пакета stopwords ). Список по умолчанию на английском языке такой же, как и tm::stopwords("en")
, за исключением того, что в пакет tm входит "will". (Если вам нужен список SMART, это stopwords("en", source = "smart")
.)
stopwords("en")
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
## [101] "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an"
## [111] "the" "and" "but" "if" "or"
## [116] "because" "as" "until" "while" "of"
## [121] "at" "by" "for" "with" "about"
## [126] "against" "between" "into" "through" "during"
## [131] "before" "after" "above" "below" "to"
## [136] "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again"
## [146] "further" "then" "once" "here" "there"
## [151] "when" "where" "why" "how" "all"
## [156] "any" "both" "each" "few" "more"
## [161] "most" "other" "some" "such" "no"
## [166] "nor" "not" "only" "own" "same"
## [171] "so" "than" "too" "very" "will"