Text preprocessing for NLP and Machine Learning using R

In this post I share some resources for those who want to learn the essential tasks to process text for analysis in R.

To implement some common text mining techniques I used the tm package (Feinerer and Horik, 2018).

install.packages("tm") # if not already installed
   library(tm)
#put the data into a corpus for text processing
   text_corpus <- (VectorSource(data))
   text_corpus <- Corpus(text_corpus)
   summary(text_corpus)
#to see the text and examine the corpus
   text_corpus[[5]]$content
   for (i in 1:5) print (text_corpus[[i]]$content)

What is preprocessing?

Text data contains characters, like punctuations, stop words etc, that does not give information and increase the complexity of the analysis. So, in order to simplify our data, we remove all this noise to obtain a clean and analyzable dataset.

“Preprocess means to bring your text into a form that is predictable and analyzable for your task. But text preprocessing is not directly transferable from task to task.”

“Preprocessing method plays a very important role in text mining techniques and applications. It is the first step in the text mining process.” (Vijayarani et al., 2015)

For example, English stop words like “of”, “an”, etc, do not give much information about context or sentiment or relationships between entities. The goal is to isolate the important words of the text.

text mining1.png
Text Mining Pre-Processing Techniques (Vijayarani et al., 2014)

Types of text preprocessing techniques

“Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. The aim of the tokenization is the exploration of the words in a sentence. The list of tokens becomes input for further processing such as parsing or text mining.” (Gurusamy and Kannan, 2014)

##Tokenization: Split a text into single word terms called "unigrams" 
text_corpus_clean<-Boost_tokenizer(text_corpus)
text_corpus_clean[[5]]$content

Normalization is a process that includes:

  • converting all letters to lower or upper case
  • removing punctuations,  numbers and white spaces
  • removing stop words, sparce terms and particular words
Example in R: by using tm package
#Normalization: lowercase the words and remove punctuation and numbers
text_corpus_clean<-tm_map(text_corpus , content_transformer(tolower))
text_corpus_clean <- tm_map(text_corpus_clean, removePunctuation)
text_corpus_clean <- tm_map(text_corpus_clean, removeNumbers)
text_corpus_clean <- tm_map(text_corpus_clean, stripWhitespace)

##Remove stopwords and custom stopwords
text_corpus_clean <- c(stopwords('english'), "a", "b") 
##Remove more stop words
myStopwords <- setdiff(myStopwords, c("d", "e")) 
text_corpus_clean <- tm_map(myCorpus, removeWords, myStopwords)

Stemming is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble). The “root” in this case may not be a real root word, but just a canonical form of the original word.

For example, the words: “presentation”, “presented”, “presenting” could all be reduced to a common representation “present”.

There are mainly two errors in stemming. Over stemming and under stemming. Over-stemming is when two words with different stems are stemmed to the same root. This is also known as a false positive. Under-stemming is when two words that should be stemmed to the same root are not. This is also known as a false negative.“(Gurusamy and Kannan, 2014)

text_corpus_clean <- tm_map(text_corpus_clean, stemDocument, language = "english")
writeLines(head(strwrap(text_corpus_clean[[2]]), 15))

Lemmatization on the surface is very similar to stemming, where the goal is to remove inflections and map a word to its root form. The only difference is that, lemmatization tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”.” Kavita Ganesan

Text Enrichment / Augmentation involves augmenting your original text data with information that you did not previously have.” 

Document-term matrix

A document term matrix is a sparce matrix where each row of the matrix is a document vector, with one column for every term in the entire corpus.

tdm <- TermDocumentMatrix(text_corpus_clean) or 
dtm <- DocumentTermMatrix(text_corpus_clean, control = list(wordLengths = c(4, Inf)))
inspect(tdm)
#inspect part of the term-document matrix
inspect(tdm[1:10, 1:50])
#Frequent terms that occur between 30 and 50 times in the corpus
frequent_terms <- findFreqTerms (tdm,30,50) 

#Word Frequency
install.packages("knitr")
library(knitr) 
# Sum all columns(words) to get frequency
words_frequency <- colSums(as.matrix(tdm)) 
# create sort order (descending) for matrix
ord <- order(words_frequency, decreasing=TRUE)

# get the top 20 words by frequency of appeearance
words_frequency[head(ord, 20)] %>% 
kable()

If you want to have a visual representation of the most frequent terms you can do a wordcloud by using the wordcloud package.

To find associations between terms you can use the findAssocs() function. As input this function uses the DTM, the word and the correlation limit (that varies between 0 to 1).

findAssocs(dtm, "word",corlimit=0.80)

A correlation of 1 means ‘always together’, a correlation of 0.5 means ‘together for 50 percent of the time’.

You can also compute dissimilarities between documents based on the DTM by using the package proxy.

install.packages('proxy')
require('proxy')
dis=dissimilarity(tdm, method="cosine")

#visualize the dissimilarity results by printing part of the big matrix
as.matrix(dis)[1:20, 1:20]
#visualize the dissimilarity results as a heatmap
heatmap(as.matrix(dis)[1:20, 1:20])

Extract bi-grams

To extract the frequency of each bigram and analyze the twenty most frequent ones you can follow the next steps.

Use Weka’s n-gram tokenizer to create a TDM that uses as terms the bigrams that appear in the corpus.

library(wordcloud)
library(qdap)
library(RColorBrewer)
library(RWeka)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm.bigram = TermDocumentMatrix(text_corpus_clean, control = list(tokenize = BigramTokenizer))

##Extract the frequency of each bigram and analyse the twenty most frequent ones.
  freq = sort(rowSums(as.matrix(tdm.bigram)),decreasing = TRUE)
  freq.df = data.frame(word=names(freq), freq=freq)
  head(freq.df, 20)

#visualize the wordcloud   
  wordcloud(freq.df$word,freq.df$freq,max.words=100,random.order = F )

#visualize the top 15 bigrams
  library(ggplot2)
  ggplot(head(freq.df,15), aes(reorder(word,freq), freq)) +
    geom_bar(stat = "identity") + coord_flip() +
    xlab("Bigrams") + ylab("Frequency") +
    ggtitle("Most frequent bigrams")

Some resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: