In this post I share some resources for those who want to learn the essential tasks to process text for analysis in R.
To implement some common text mining techniques I used the tm package (Feinerer and Horik, 2018).
install.packages("tm") # if not already installed library(tm) #put the data into a corpus for text processing text_corpus <- (VectorSource(data)) text_corpus <- Corpus(text_corpus) summary(text_corpus) #to see the text and examine the corpus text_corpus[]$content for (i in 1:5) print (text_corpus[[i]]$content)
What is preprocessing?
Text data contains characters, like punctuations, stop words etc, that does not give information and increase the complexity of the analysis. So, in order to simplify our data, we remove all this noise to obtain a clean and analyzable dataset.
“Preprocessing method plays a very important role in text mining techniques and applications. It is the first step in the text mining process.” (Vijayarani et al., 2015)
For example, English stop words like “of”, “an”, etc, do not give much information about context or sentiment or relationships between entities. The goal is to isolate the important words of the text.
Types of text preprocessing techniques
“Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The aim of the tokenization is the exploration of the words in a sentence. The list of tokens becomes input for further processing such as parsing or text mining.” (Gurusamy and Kannan, 2014)
##Tokenization: Split a text into single word terms called "unigrams" text_corpus_clean<-Boost_tokenizer(text_corpus) text_corpus_clean[]$content
Normalization is a process that includes:
- converting all letters to lower or upper case
- removing punctuations, numbers and white spaces
- removing stop words, sparce terms and particular words
Example in R: by using tm package #Normalization: lowercase the words and remove punctuation and numbers text_corpus_clean<-tm_map(text_corpus , content_transformer(tolower)) text_corpus_clean <- tm_map(text_corpus_clean, removePunctuation) text_corpus_clean <- tm_map(text_corpus_clean, removeNumbers) text_corpus_clean <- tm_map(text_corpus_clean, stripWhitespace) ##Remove stopwords and custom stopwords text_corpus_clean <- c(stopwords('english'), "a", "b") ##Remove more stop words myStopwords <- setdiff(myStopwords, c("d", "e")) text_corpus_clean <- tm_map(myCorpus, removeWords, myStopwords)
“Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.”
For example, the words: “presentation”, “presented”, “presenting” could all be reduced to a common representation “present”.
“There are mainly two errors in stemming. Over stemming and under stemming. Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive. Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative.“(Gurusamy and Kannan, 2014)
text_corpus_clean <- tm_map(text_corpus_clean, stemDocument, language = "english") writeLines(head(strwrap(text_corpus_clean[]), 15))
“Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.”
A document term matrix is a sparce matrix where each row of the matrix is a document vector, with one column for every term in the entire corpus.
tdm <- TermDocumentMatrix(text_corpus_clean) or dtm <- DocumentTermMatrix(text_corpus_clean, control = list(wordLengths = c(4, Inf))) inspect(tdm) #inspect part of the term-document matrix inspect(tdm[1:10, 1:50])
#Frequent terms that occur between 30 and 50 times in the corpus
frequent_terms <-findFreqTerms (tdm,30,50) #Word Frequency install.packages("knitr") library(knitr) # Sum all columns(words) to get frequency words_frequency <- colSums(as.matrix(tdm)) # create sort order (descending) for matrix ord <- order(words_frequency, decreasing=TRUE) # get the top 20 words by frequency of appeearance words_frequency[head(ord, 20)] %>% kable()
If you want to have a visual representation of the most frequent terms you can do a wordcloud by using the wordcloud package.
To find associations between terms you can use the findAssocs() function. As input this function uses the DTM, the word and the correlation limit (that varies between 0 to 1).
A correlation of 1 means ‘always together’, a correlation of 0.5 means ‘together for 50 percent of the time’.
You can also compute dissimilarities between documents based on the DTM by using the package proxy.
install.packages('proxy') require('proxy') dis=dissimilarity(tdm, method="cosine") #visualize the dissimilarity results by printing part of the big matrix as.matrix(dis)[1:20, 1:20] #visualize the dissimilarity results as a heatmap heatmap(as.matrix(dis)[1:20, 1:20])
To extract the frequency of each bigram and analyze the twenty most frequent ones you can follow the next steps.
Use Weka’s n-gram tokenizer to create a TDM that uses as terms the bigrams that appear in the corpus.
library(wordcloud) library(qdap) library(RColorBrewer) library(RWeka) BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm.bigram = TermDocumentMatrix(text_corpus_clean, control = list(tokenize = BigramTokenizer)) ##Extract the frequency of each bigram and analyse the twenty most frequent ones. freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE) freq.df = data.frame(word=names(freq), freq=freq) head(freq.df, 20) #visualize the wordcloud wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F ) #visualize the top 15 bigrams library(ggplot2) ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) + geom_bar(stat = "identity") + coord_flip() + xlab("Bigrams") + ylab("Frequency") + ggtitle("Most frequent bigrams")
- Bail, C. Basic Text Analysis in R
- Gurusamy, V. and Kannan, S. (2014), ‘Preprocessing Techniques for Text Mining’, Conference Paper, No. October 2014.
- Vijayarani, S., Ilamathi, M.J. and Nithya, M. (2015), ‘Preprocessing Techniques for Text Mining – An Overview’, International Journal of Computer Science & Communication Networks, Vol. 5 No. 1, pp. 7–16.
- Text mining in R
- An example in Python –> Text Preprocessing in Python: Steps, Tools, and Examples
- Code tidbits for preprocessing texts
- Text Mining with R: A Tidy Approach
- Using quanteda for Text Processing