Question

Я создаю облако слов с помощью пакета wordcloud в R и с помощью « Облако слов в R ».

Я могу сделать это достаточно легко, но я хочуудалить слова из этого облака слов.У меня есть слова в файле (на самом деле это файл Excel, но я могу это изменить), и я хочу исключить все эти слова, которых насчитывается пара сотен.Есть предложения?

require(XML)
require(tm)
require(wordcloud)
require(RColorBrewer)
ap.corpus=Corpus(DataframeSource(data.frame(as.character(data.merged2[,6]))))
ap.corpus=tm_map(ap.corpus, removePunctuation)
ap.corpus=tm_map(ap.corpus, tolower)
ap.corpus=tm_map(ap.corpus, function(x) removeWords(x, stopwords("english")))
ap.tdm=TermDocumentMatrix(ap.corpus)
ap.m=as.matrix(ap.tdm)
ap.v=sort(rowSums(ap.m),decreasing=TRUE)
ap.d=data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)

Ben · Answer 1 · 24 декабря 2011

@ Тайлер Ринкер дал ответ, просто добавьте еще одну строку removeWords(), но здесь немного подробнее.

Допустим, ваш файл Excel называется nuts.xls и содержит один столбец слов, подобных этому

stopwords
peanut
cashew
walnut
almond
macadamia

В R вы можете поступить так

     library(gdata) # package with xls import function
     library(tm)
     # now load the excel file with the custom stoplist, note a few of the arguments here 
     # to clean the data by removing spaces that excel seems to insert and prevent it from 
     # importing the characters as factors. You can use any args from read.table(), which is
     # handy
     nuts<-read.xls("nuts.xls", header=TRUE, stringsAsFactor=FALSE, strip.white=TRUE)

     # now make some words to build a corpus to test for a two-step stopword removal process...
     words1<- c("peanut, cashew, walnut, macadamia, apple, pear, orange, lime, mandarin, and, or, but")
     words2<- c("peanut, cashew, walnut, almond, apple, pear, orange, lime, mandarin, if, then, on")
     words3<- c("peanut, walnut, almond, macadamia, apple, pear, orange, lime, mandarin, it, as, an")
     words.all<-data.frame(rbind(words1,words2,words3))
     words.corpus<-Corpus(DataframeSource((words.all)))

     # now remove the standard list of stopwords, like you've already worked out
     words.corpus.nostopwords <- tm_map(words.corpus, removeWords, stopwords("english"))
     # now remove the second set of stopwords, this time your custom set from the excel file, 
     # note that it has to be a reference to a character vector containing the custom stopwords
     words.corpus.nostopwords <- tm_map(words.corpus.nostopwords, removeWords, nuts$stopwords)

     # have a look to see if it worked
     inspect(words.corpus.nostopwords)
     A corpus with 3 text documents

     The metadata consists of 2 tag-value pairs and a data frame
     Available tags are:
          create_date creator 
     Available variables in the data frame are:
          MetaID 

     $words1
        , , , , apple, pear, orange, lime, mandarin, , , 

     $words2
        , , , , apple, pear, orange, lime, mandarin, , , 

     $words3
        , , , , apple, pear, orange, lime, mandarin, , ,

Успех! стандартные стоп-слова исчезли, как и слова в пользовательском списке из файла Excel. Несомненно, есть и другие способы сделать это.

Mano Yakandawala · Answer 2 · 16 октября 2017

Преобразование данных, которые вы хотите сделать облаком данных, во фрейм данных.Создайте файл CSV со словами, которые вы хотите исключить, и прочитайте его как фрейм данных.Затем вы можете сделать anti_join:

allWords = as.data.frame(table(bigWords$Words))

wordsToAvoid = read.csv("wordsToDrop.csv")

finalWords = anti_join(allWords, wordsToAvoid, by = "Words")

Как удалить слова из облака слов?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как удалить слова из облака слов?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

Ответы [ 2 ]

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы