Association Rules Example with R

For this example we used data from the UCI Machine Learning Repository. The data contains transactions of a UK-based online retailer that where made between 01/12/2010 and 09/12/2011. More details can be found here.

Attribute Information:

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

Before start searching for rules we did a descriptive analysis for understand the data under study

We have 541 909 items sell, an average of 21 items by transaction (25.900 different invoices of 4.374 customers). The maximum achieved in items was 1.114 in a transaction and the minimum was 1 item. 50% of the transactions had less then 10 items.

summary(as.numeric(as.data.frame(table(InvoiceNo))[,2]))
boxplot(as.numeric(as.data.frame(table(InvoiceNo))[,2]))

The country with 91% of the items purchased was UK followed by Germany (2%).

top 10 countries.png

countryT<-table(Country)
countryT<-countryT[order(countryT,decreasing = TRUE)]
barplot(countryT[1:10], main="Crountries Purchase Frequency", ylab = "Country",
 xlab="Number of transactions",horiz=FALSE, font.axis = 1, col.axis = "Blue", las = 1,col=terrain.colors(10),cex.lab=1,
 cex.axis=1, cex.names=0.75)
countryT[1]/length(Country)

In order to analyse the days of the week and hours with more and less transactions we used the package lubridate. The day of the week with more transactions was Sunday followed by Wednesday. The period of the day with more transactions was between 12 and 14.

weekday.png

retail$Date <- as.Date(retail$InvoiceDate)
week_day<-weekdays(Date)
tw<-table(week_day)[order(table(week_day),decreasing = TRUE)]
tw<-as.data.frame(tw)
bp<- ggplot(tw, aes(x="", y=Freq, fill=week_day))+ 
 ggtitle("Purchases by weekday")+
 geom_bar(width = 2, stat = "identity")+
 geom_text(aes(label = Freq), size = 5, vjust = 3,hjust = 0.5, position = "stack") +
 scale_fill_brewer(direction = -1)
bp

hour

retail2<-retail %>% separate(InvoiceDate, c("date", "time") , sep = " ", convert = FALSE)
retail$hourT<-as.numeric(substr(retail2[,6],1,nchar(retail2[,6])-3))
hp<-qplot(retail$hourT, geom="histogram" , binwidth = 1, main = "Histogram for on-line purchases by hour", 
 xlab = "hour", fill=..count.., xlim=c(5,23)) hp+scale_fill_gradient(low="blue", high="red")

On sunday the number of transactions keeps high until 17h. At 14h there is a sharply decrease that recovers at 15h.

hourSunday.png

What are the products most sold?

top15.png

par(mar=c(3,15,3,5),mgp = c(14, 1, 0))
mp<-barplot(top15, main="Top 15 most frequent products", ylab = "Products",
 xlab="Number of transactions",horiz=TRUE, font.axis = 1, col.axis = "BlACK",
 las = 1,col=topo.colors(15),cex.lab=1, cex.axis=1, cex.names=0.75)

Transform the data from the data frame format into transactions format

We followed the example presented by Susan Li that you can find here.

library(plyr)
itemList2<-ddply(retail,c("InvoiceNo"), 
 function(df1)paste(df1$Description, collapse = ","))
colnames(itemList2) <- c("items")
itemList2$InvoiceNo <- NULL
write.csv(itemList2,"market_basket2.csv", quote = FALSE, row.names = TRUE)
tr <- read.transactions('market_basket2.csv', format = 'basket', sep=',')
tr
summary(tr)

After having the file with all the transactions we have a summary that show us that we have 25.901 transactions and 60.778 different items. The percentage of non-empty cells in the sparse matrix is 0.03%.

transactions file

Searching for rules with arules:

  • arules: arules base package with data structures, mining algorithms (APRIORI and ECLAT), interest measures.
  • arulesViz: Visualization of association rules.
  • arulesCBA: Classification algorithms based on association rules (includes CBA).
  • arulesSequences: Mining frequent sequences (cSPADE)
Apriori-Algorithm.jpg
General Process of the Apriori algorithm

itemFrequencyPlot(tr, topN=20, type=’absolute’)
4.png
For a minimum support 0.003 and confidence 0.8 the rule distribution shows that a length of 4 items has more rules. We did some experiences varying the support and confidence values.

rules <- apriori(tr, parameter = list(supp=0.003, conf=0.8))
rules <- sort(rules, by='confidence', decreasing = TRUE)
summary(rules)
inspect(rules[1:10])
inspect(SORT(rules, by = "confidence")[1:3])
plot(topRules)
plot(topRules, method="graph")

arules3.png

graph rules1

For a minimum support 0.009 and confidence 0.8 we obtained 20 rules, the rule distribution shows that a length of 2 items has more rules.

graph rules
Visualized using the arulesViz R library

The useful rule contains high-quality, actionable information” in Michael J. A. Berry, Gordon S. Linoff-Data Mining Techniques For Marketing, Sales, and Customer Relationship Management

We selected a subset of rules of the frequent item “PARTY BUNTING”.

  • Clients who bought “PARTY BUNTING” also bought “DOORMAT FRIENDSHIP” and “RED RETROSPOT PEG BAG”.subset.png

Using association rules insights for Marketing decision making

The results can be used to drive targeted marketing campaigns. For each user, we pick a handful of products based on products they have bought to date which have both a high uplift and a high margin, and send them a e.g. personalized email or display ads etc.

How we use the analysis has significant implications for the analysis itself: if we are feeding the analysis into a machine-driven process for delivering recommendations, we are much more interested in generating an expansive set of rules. If, however, we are experimenting with targeted marketing for the first time, it makes much more sense to pick a handful of particularly high value rules, and action just them, before working out whether to invest in the effort of building out that capability to manage a much wider and more complicated rule set.” in Market Basket Analysis: identifying products and content that go well together

“The rules so generated fall into three categories. Useful rules explain a relationship that was perhaps unexpected. Trivial rules explain relationships that are known (or should be known) to exist. And inexplicable rules simply do not make sense. Inexplicable rules often have weak support.” in Michael J. A. Berry, Gordon S. Linoff-Data Mining Techniques For Marketing, Sales, and Customer Relationship Management

Examples:

References:

 

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s