For this example we used data from the UCI Machine Learning Repository. The data contains transactions of a UK-based online retailer that where made between 01/12/2010 and 09/12/2011. More details can be found here.
Attribute Information: InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. UnitPrice: Unit price. Numeric, Product price per unit in sterling. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal, the name of the country where each customer resides.
Before start searching for rules we did a descriptive analysis for understand the data under study
We have 541 909 items sell, an average of 21 items by transaction (25.900 different invoices of 4.374 customers). The maximum achieved in items was 1.114 in a transaction and the minimum was 1 item. 50% of the transactions had less then 10 items.
The country with 91% of the items purchased was UK followed by Germany (2%).
countryT<-table(Country) countryT<-countryT[order(countryT,decreasing = TRUE)] barplot(countryT[1:10], main="Crountries Purchase Frequency", ylab = "Country", xlab="Number of transactions",horiz=FALSE, font.axis = 1, col.axis = "Blue", las = 1,col=terrain.colors(10),cex.lab=1, cex.axis=1, cex.names=0.75) countryT/length(Country)
In order to analyse the days of the week and hours with more and less transactions we used the package lubridate. The day of the week with more transactions was Sunday followed by Wednesday. The period of the day with more transactions was between 12 and 14.
retail$Date <- as.Date(retail$InvoiceDate) week_day<-weekdays(Date) tw<-table(week_day)[order(table(week_day),decreasing = TRUE)] tw<-as.data.frame(tw) bp<- ggplot(tw, aes(x="", y=Freq, fill=week_day))+ ggtitle("Purchases by weekday")+ geom_bar(width = 2, stat = "identity")+ geom_text(aes(label = Freq), size = 5, vjust = 3,hjust = 0.5, position = "stack") + scale_fill_brewer(direction = -1) bp
retail2<-retail %>% separate(InvoiceDate, c("date", "time") , sep = " ", convert = FALSE) retail$hourT<-as.numeric(substr(retail2[,6],1,nchar(retail2[,6])-3)) hp<-qplot(retail$hourT, geom="histogram" , binwidth = 1, main = "Histogram for on-line purchases by hour", xlab = "hour", fill=..count.., xlim=c(5,23)) hp+scale_fill_gradient(low="blue", high="red")
On sunday the number of transactions keeps high until 17h. At 14h there is a sharply decrease that recovers at 15h.
What are the products most sold?
par(mar=c(3,15,3,5),mgp = c(14, 1, 0)) mp<-barplot(top15, main="Top 15 most frequent products", ylab = "Products", xlab="Number of transactions",horiz=TRUE, font.axis = 1, col.axis = "BlACK", las = 1,col=topo.colors(15),cex.lab=1, cex.axis=1, cex.names=0.75)
Transform the data from the data frame format into transactions format
We followed the example presented by Susan Li that you can find here.
library(plyr) itemList2<-ddply(retail,c("InvoiceNo"), function(df1)paste(df1$Description, collapse = ",")) colnames(itemList2) <- c("items") itemList2$InvoiceNo <- NULL write.csv(itemList2,"market_basket2.csv", quote = FALSE, row.names = TRUE) tr <- read.transactions('market_basket2.csv', format = 'basket', sep=',') tr summary(tr)
After having the file with all the transactions we have a summary that show us that we have 25.901 transactions and 60.778 different items. The percentage of non-empty cells in the sparse matrix is 0.03%.
Searching for rules with arules:
- arules: arules base package with data structures, mining algorithms (APRIORI and ECLAT), interest measures.
- arulesViz: Visualization of association rules.
- arulesCBA: Classification algorithms based on association rules (includes CBA).
- arulesSequences: Mining frequent sequences (cSPADE)
itemFrequencyPlot(tr, topN=20, type=’absolute’)
For a minimum support 0.003 and confidence 0.8 the rule distribution shows that a length of 4 items has more rules. We did some experiences varying the support and confidence values.
rules <- apriori(tr, parameter = list(supp=0.003, conf=0.8)) rules <- sort(rules, by='confidence', decreasing = TRUE) summary(rules) inspect(rules[1:10]) inspect(SORT(rules, by = "confidence")[1:3]) plot(topRules) plot(topRules, method="graph")
For a minimum support 0.009 and confidence 0.8 we obtained 20 rules, the rule distribution shows that a length of 2 items has more rules.
“The useful rule contains high-quality, actionable information” in Michael J. A. Berry, Gordon S. Linoff-Data Mining Techniques For Marketing, Sales, and Customer Relationship Management
We selected a subset of rules of the frequent item “PARTY BUNTING”.
- Clients who bought “PARTY BUNTING” also bought “DOORMAT FRIENDSHIP” and “RED RETROSPOT PEG BAG”.
Using association rules insights for Marketing decision making
“The results can be used to drive targeted marketing campaigns. For each user, we pick a handful of products based on products they have bought to date which have both a high uplift and a high margin, and send them a e.g. personalized email or display ads etc.
How we use the analysis has significant implications for the analysis itself: if we are feeding the analysis into a machine-driven process for delivering recommendations, we are much more interested in generating an expansive set of rules. If, however, we are experimenting with targeted marketing for the first time, it makes much more sense to pick a handful of particularly high value rules, and action just them, before working out whether to invest in the effort of building out that capability to manage a much wider and more complicated rule set.” in Market Basket Analysis: identifying products and content that go well together
“The rules so generated fall into three categories. Useful rules explain a relationship that was perhaps unexpected. Trivial rules explain relationships that are known (or should be known) to exist. And inexplicable rules simply do not make sense. Inexplicable rules often have weak support.” in Michael J. A. Berry, Gordon S. Linoff-Data Mining Techniques For Marketing, Sales, and Customer Relationship Management
- Association Rules as a Decision Making Model in the Textile Industry
- Creating a Decision-Making Model Using Association Rules
- arules: Association Rule Mining with R — A Tutorial
- A Gentle Introduction on Market Basket Analysis — Association Rules
- Market Basket Analysis: understanding Customer Behaviour
- Association Mining (Market Basket Analysis)
- Example in Python: http://adataanalyst.com/machine-learning/apriori-algorithm-python-3-0/