Visualizing product relationships in a market Basket analysis
Last week had been very hectic. I had slogged more than 100 hours to come out with an awesome recommender based on market basket analysis.
“Now was the time to shine!” I thought, just before the meeting with stakeholders was about to start. I had prepared a good presentation and was feeling confident about the work. Thirty minutes into the presentation, I was trying my level best to explain lift, support and confidence in an imaginary 3d plane to the stakeholders.
Guess what – they were not impressed, they found the technique too complex. The meeting ended up with the key stakeholder saying “Can you create something simpler and more intuitive?”
This is when I went back to a drawing board and came out with this technique to visualize and explain market basket analysis in very simple visualization. This was the core thought behind this technique:
Algorithm used in Text mining can be leveraged to create relationship plots in a Market basket analysis.
Market basket is a widely used analytical tool in retail industry. However, retail industry use it extensively, this is no way an indication that the usage is limited to retail industry. Various X-sell strategies in different industries can be made using a market basket analysis. There is a good amount of content available in the web world on the theory behind market basket analysis but I have hardly seen any articles on how to visualize market basket analysis . In this article, I will leverage some algorithm of text mining to get such visual plots.
Some basic Definitions
Support : Support is simply the probability of an event to occur. If we have an event to buy a product A, Support(A) is simple the number of transactions which includes A divided by total number of transactions.
Confidence : Confidence is essentially the conditional probability of an event A happening given that B has happened.
For more detailed definition refer to our last article (last post).
Importing the dataset
The first part of any analysis is to bring in the dataset. I am using a dummy data to demonstrate this application. The data has details of 12k transactions. Each transaction has 3 products. Following is the code to import the transaction data stored in a CSV file.
[stextbox id=”grey”]txn_data<-read.csv("Retail_Data.csv")
summary(txn_data)
transaction_id Prod1 Prod2 Prod3
Min. : 100001 A:2983 E:3962 H:5907
1st Qu.: 103001 B:3024 F:4053 I:6093
Median :106001 C:3047 G:3985
Mean : 106001 D:2946
3rd Qu.: 109000
Max. : 112000
As you can observe, each transaction has all 3 products. Product 1 takes only A,B,C and D. Product 2 takes E,F and G. Product 3 takes H and I. All the three products are mutually exclusive.
Creating an “item-transaction” Matrix
This is a concept, I learned in text mining. But it very well fits into this application as well. We will first create a matrix with flags on each product. In total we have 9 products, hence we generate 9 vectors to capture these flags. Here is the code to generate the 9 vectors and joining them to form item document matrix.
[stextbox id=”grey”]#Initializing vectors A <- numeric(0) B <- numeric(0) C <- numeric(0) D <- numeric(0) E <- numeric(0) F <- numeric(0) G <- numeric(0) H <- numeric(0) I <- numeric(0) #Preparing the flag metrics for ( i in 1:nrow(txn_data)) { if (txn_data$Prod1[i] == "A") A[i] <- 1 else A[i]<-0 if (txn_data$Prod1[i] == "B") B[i] <- 1 else B[i]<-0 if (txn_data$Prod1[i] == "C") C[i] <- 1 else C[i]<-0 if (txn_data$Prod1[i] == "D") D[i] <- 1 else D[i]<-0 if (txn_data$Prod2[i] == "E") E[i] <- 1 else E[i]<-0 if (txn_data$Prod2[i] == "F") F[i] <- 1 else F[i]<-0 if (txn_data$Prod2[i] == "G") G[i] <- 1 else G[i]<-0 if (txn_data$Prod3[i] == "H") H[i] <- 1 else H[i]<-0 if (txn_data$Prod3[i] == "I") I[i] <- 1 else I[i]<-0 } final.mat <- rbind(A,B,C,D,E,F,G,H,I)[/stextbox]
Creating plots using igraph library
Once we have the transactions-item matrix, it is time to create an item-item correlation matrix. I have done this using a simple mathematical formulation. We multiple the transaction-item matrix with its own transpose to get item-item correlation matrix. In this matrix, the number on diagonal gives an indication of Support whereas all other numbers give the confidence. We use both these numbers to build a relationship plot. Following is the code to build the matrix and the plot.
[stextbox id=”grey”]#Creating the relationship matrix termMatrix <- final.mat %*% t(final.mat) #Creating the graphs library(igraph) # build a graph from the above matrix g <- graph.adjacency(termMatrix, weighted=T, mode = "undirected") # remove loops g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) # set seed to make the layout reproducible set.seed(3952) layout1 <- layout.fruchterman.reingold(g) plot(g, layout=layout1) plot(g, layout=layout.kamada.kawai) tkplot(g, layout=layout.kamada.kawai)[/stextbox]
As of now we have not incorporated the strength of confidence or the support to plot this graph. Something to observe in this plot is that products like A and B are not connected. This is simply because they never co-exist together in any transaction. This plot can be use to visualize the negative lift items. Such items should not be placed near each other. The next step is to incorporate the support of each product in the visual plot.
[stextbox id=”grey”]V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2 V(g)$label.color <- rgb(0, 0, .2, .8) V(g)$frame.color <- NA egam <- (log(E(g)$weight)+0.2) / max(log(E(g)$weight)+0.2)[/stextbox]
Here, we have incorporated the support of each product. As you can see H and I form the biggest letters and A,B,C and D the smallest. This is an indication of higher and lower support. You can validate these inferences from the initial frequency distribution. The next step is to incorporate the confidence as well in the relationship line width.
[stextbox id=”grey”]E(g)$color <- rgb(.5, .5, 0, egam) E(g)$width <- egam # plot the graph in layout1 plot(g, layout=layout1) tkplot(g, layout=layout.kamada.kawai)[/stextbox]
The final plot makes the entire story clear. We have already seen that H and I have the highest support. Now it is also clear that E-I , I-F and H-F have a high confidence as well. Hence, if a customer buys a product F there is a high propensity that he will also buy product H and I. Hence, following are the rules which we can infer from this analysis :
1. If a customer buys E, he has a high propensity to also buy I.
2.If a customer buys F, he has a high propensity to also buy I.
3. If a customer buys F, he has a high propensity to also buy H.
4. If a customer buys I, there is very small that he will also buy H.
The arrangement of items should flow from these rules in order to maximize the sales.
End Notes
Graphical representation of market basket analysis makes the interpretation of the entire puzzle of “probabilities/conditional probability/lift above random events” much simpler than a tabular format. This simplification can be more appreciated when we have a large number of transactions and product list. In case of large lists, we can simply find out using the dimension of product sign and width of the line connecting them to infer out simple rules which otherwise were buried in a matrix of complex probabilities.
Have you ever visualized relationships in a market basket analysis? If you did, what algorithm did you use? Did you find the article useful? Did this article solve any of your existing dilemma?
9 thoughts on "Visualizing product relationships in a market Basket analysis"
Apurv Rohiwal says: August 08, 2014 at 6:58 am
Which tool did you use to write that code ?Clarence Wong says: August 08, 2014 at 10:22 am
Hi Tavish, Excellent work. You could also try the arulesViz package which provides network visualizations Example subrules2 <- head(sort(rules, by="lift"), 10) plot(subrules2, method="graph",control=list(type="items")) there are more control parameters which you can read about and apply if required.Tavish Srivastava says: August 08, 2014 at 5:55 pm
I used R for this application.Tavish Srivastava says: August 08, 2014 at 5:56 pm
Thanks Clarence for sharing the additional code.Maurice Nsabimana says: August 09, 2014 at 11:54 am
Well done, thank you for sharing! Two minor corrections may be in order: -Refer to nine (9) products/vectors rather than eight (8)? -Rule four above is about I -> E, not I -> I? Thanks!Tavish Srivastava says: August 09, 2014 at 12:12 pm
Thanks Maurice, You are right. Both the changes have been made.Gurdal Ertek says: August 12, 2014 at 8:02 pm
In 2006, we had developed a graph-based visualization scheme for visualizing item associations. I suggest that you take a look at this visualization, which was implemented in an R package: http://ertekprojects.com/gurdal-ertek-publications/blog/2014/07/12/a-framework-for-visualizing-association-mining-results/Tavish Srivastava says: August 12, 2014 at 8:18 pm
Gurdal, Thanks for the link. This was really helpful. TavishVaishali says: January 03, 2015 at 12:02 pm
Hi Tavish, Thanks for sharing your blog. This is really informative. Market basket analysis relies on techniques like co-occurrence tables and Apriori algorithms for identifying patterns and determining statistically significant associations. Using market basket analysis, a retailer could discover any number of non-intuitive patterns in the data. It might learn, for example, that if a customer buys eggs, he'll also buy milk, that the correlation between Xbox purchases and Netflix subscriptions is high, or that the probability that a customer will upgrade to a smartphone after having bought a bank product is 0.7. Market basket analysis can provide an understanding of sales patterns, help a company manage inventory, or plan promotion and marketing spend on specific product lines. Retail giants like Target, as discussed in this Retail TouchPoints article, and Wal-Mart, are known to be leveraging market basket analysis to sell desired products lines to specific audience segments, thus enhancing the shopping experience for customers.