The following is not something new but something that I have put together this evening, and I mainly make the following available as a note to myself and what I did. If you find it useful or interesting then you are more than welcome to use and share. You will also find lots of similar solutions on the web.
This evening I was playing around the the Text Mining ™ package in R. So I decided to create a Word Cloud of the Advanced Analytics webpages on Oracle.com. These webpages contain the Overview webpage for the Advanced Analytics webpage, the Oracle Data Mining webpages and the Oracle R Enterprise webpages.
I’ve broken the R code into a number of sections.
The first thing that you need to do is to install four R packages these are “tm”, “wordcloud” , “Curl” and “XML”. The first two of these packages are needed for the main part of the Text processing and generating the word cloud. The last two of these packages are needed by the function “htmlToText”. You can download the htmlToText function on github.
install.packages (c ( “tm”, “wordcloud”, “RCurl”, “XML”, “SnowballC”)) # install ‘tm” package
# load htmlToText
2. Read in the Oracle Advanced Analytics webpages using the htmlToText function
data2 <- htmlToText("http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html")
You will need to combine each of these webpages into one for processing in later steps.
data <- c(data1, data2)
data <- c(data, data3)
data <- c(data, data4)
3. Convert into a Corpus and perfom Data Cleaning & Transformations
To convert our web documents into a Corpus.
txt_corpus <- Corpus (VectorSource (data)) # create a corpus
We can use the summary function to get some of the details of the Corpus. We can see that we have 4 documents in the corpus.
A corpus with 4 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
Available variables in the data frame are:
Remove the White Space in these documents
tm_map <- tm_map (txt_corpus, stripWhitespace) # remove white space
Remove the Punctuations from the documents
tm_map <- tm_map (tm_map, removePunctuation) # remove punctuations
Remove number from the documents
tm_map <- tm_map (tm_map, removeNumbers) # to remove numbers
Remove the typical list of Stop Words
tm_map <- tm_map (tm_map, removeWords, stopwords("english")) # to remove stop words(like ‘as’ ‘the’ etc….)
Apply stemming to the documents
If needed you can also apply stemming on your data. I decided to not perform this as it seemed to trunc some of the words in the word cloud.
# tm_map <- tm_map (tm_map, stemDocument)
If you do want to perform stemming then just remove the # symbol.
Remove any addition words (would could add other words to this list)
tm_map <- tm_map (tm_map, removeWords, c("work", "use", "java", "new", "support"))
If you want to have a look at the output of each of the above commands you can use the inspect function.
4. Convert into a Text Document Matrix and Sort
Matrix <- TermDocumentMatrix(tm_map) # terms in rows
matrix_c <- as.matrix (Matrix)
freq <- sort (rowSums (matrix_c)) # frequency data
freq #to view the words and their frequencies
5. Generate the Word Cloud
tmdata <- data.frame (words=names(freq), freq)
wordcloud (tmdata$words, tmdata$freq, max.words=100, min.freq=3, scale=c(7,.5), random.order=FALSE, colors=brewer.pal(8, “Dark2”))
and the World Clould will look something like the following. Everything you generate the Word Cloud you will get a slightly different layout of the words.