Wiley online library 关于 LI-6400 的文献摘要的词云分析

祝介东

2018/11/24

书接上文,上一篇分享了 Wiley online library 关于 LI-6400 文献的爬取,简单分析了标题的词云,觉得文本量有点小,于是顺手就把那些文献的摘要爬了,顺便分析一下其摘要关键词都是讲什么内容。需要注意的是,我去除了大家所熟悉的 photosynthesis, may 等常见的词,对我们分析有干扰,谁都会知道这些词频率高,使用了 wordcloud2 作为主要工具,应该是作者所讲的,可能是 R 里面最好的词云解决方案了。

library(tm)
library(SnowballC)
library(wordcloud2)
library(RColorBrewer)

# Read the Data
abstract <- readLines("./data/wiley_abs.txt")

# converts the 'text' column (multibyte) to utf8 form
docs <- iconv(enc2utf8(abstract),sub="byte")

docs <- Corpus(VectorSource(docs))

# remove special chatactors to space
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "-")
  
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# remove tags
myCorpus <- tm_map(myCorpus, PlainTextDocument)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("photosynthesis", "photosynthetic", "one",
              "also", "rate", "result", "suggest", "effects", "however",
              "two", "non", "less", "total", "may", "can", "using",
              "year", "supply")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
#docs <- tm_map(docs, stemDocument)

# docs$content

# TermDocumentMatrix() word frequency matrix
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 500)

letterCloud(d, word = "ECOTEK", color = "random-light",
            backgroundColor = "black",size = 0.3)

wordcloud.jpg

最终词云的形状使用了鄙人所在公司的名字,仅作示例使用。