# Visualizing product relationships in a market Basket analysis

Last week had been very hectic. I had slogged more than 100 hours to come out with an awesome recommender based on market basket analysis.

“Now was the time to shine!” I thought, just before the meeting with stakeholders was about to start. I had prepared a good presentation and was feeling confident about the work. Thirty minutes into the presentation, I was trying my level best to explain lift, support and confidence in an imaginary 3d plane to the stakeholders.

Guess what – they were not impressed, they found the technique too complex. The meeting ended up with the key stakeholder saying “Can you create something simpler and more intuitive?” This is when I went back to a drawing board and came out with this technique to visualize and explain market basket analysis in very simple visualization. This was the core thought behind this technique:

##### Algorithm used in Text mining can be leveraged to  create relationship plots in a Market basket analysis.

Market basket is a widely used analytical tool in retail industry. However, retail industry use it extensively, this is no way an indication that the usage is limited to retail industry.  Various X-sell strategies in different industries can be made using a market basket analysis. There is a good amount of content available in the web world on the theory behind market basket analysis but I have hardly seen any articles on how to visualize market basket analysis . In this article, I will leverage some algorithm of text mining to get such visual plots.

#### Some basic Definitions

Support : Support is simply the probability of an event  to occur. If we have an event to buy a product A, Support(A) is simple the number of transactions which includes A divided by total number of transactions.

Confidence : Confidence is essentially the conditional probability of an event A happening given that B has happened.

For more detailed definition refer to our last article (last post).

#### Importing the dataset

The first part of any analysis is to bring in the dataset. I am using a dummy data to demonstrate this application. The data has details of 12k transactions. Each transaction has 3 products.  Following is the code to import the transaction data stored in a CSV file.

`txn_data<-read.csv("Retail_Data.csv")`
`summary(txn_data)`

transaction_id                                     Prod1                         Prod2                     Prod3

Min. :      100001                                 A:2983                        E:3962                   H:5907
1st Qu.: 103001                                 B:3024                        F:4053                    I:6093
Median :106001                                C:3047                        G:3985
Mean :   106001                                D:2946
3rd Qu.: 109000
Max. :     112000

As you can observe, each transaction has all 3 products. Product 1 takes only A,B,C and D. Product 2 takes E,F and G. Product 3 takes H and I. All the three products are mutually exclusive.

#### Creating an “item-transaction” Matrix

This is a concept, I learned in text mining. But it very well fits into this application as well.  We will first create a matrix with flags on each product. In total we have 9 products, hence we generate 9 vectors to capture these flags.  Here is the code to generate the 9 vectors and joining them to form item document matrix.

```#Initializing vectors
A <- numeric(0)
B <- numeric(0)
C <- numeric(0)
D <- numeric(0)
E <- numeric(0)
F <- numeric(0)
G <- numeric(0)
H <- numeric(0)
I <- numeric(0)
#Preparing the flag metrics
for ( i in 1:nrow(txn_data))
{
if (txn_data\$Prod1[i] == "A") A[i] <- 1 else A[i]<-0
if (txn_data\$Prod1[i] == "B") B[i] <- 1 else B[i]<-0
if (txn_data\$Prod1[i] == "C") C[i] <- 1 else C[i]<-0
if (txn_data\$Prod1[i] == "D") D[i] <- 1 else D[i]<-0
if (txn_data\$Prod2[i] == "E") E[i] <- 1 else E[i]<-0
if (txn_data\$Prod2[i] == "F") F[i] <- 1 else F[i]<-0
if (txn_data\$Prod2[i] == "G") G[i] <- 1 else G[i]<-0
if (txn_data\$Prod3[i] == "H") H[i] <- 1 else H[i]<-0
if (txn_data\$Prod3[i] == "I") I[i] <- 1 else I[i]<-0
}
final.mat <- rbind(A,B,C,D,E,F,G,H,I)```

#### Creating plots using igraph library

Once we have the transactions-item matrix, it is time to create an item-item correlation matrix. I have done this using a simple mathematical formulation.  We multiple the transaction-item matrix with its own transpose to get item-item correlation matrix. In this matrix, the number on diagonal gives an indication of Support whereas all other numbers give the confidence.  We use both these numbers to build a relationship plot. Following is the code to build the matrix and the plot.

```#Creating the relationship matrix
termMatrix <- final.mat %*% t(final.mat)
#Creating the graphs
library(igraph)
# build a graph from the above matrix
g <- graph.adjacency(termMatrix, weighted=T, mode = "undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)\$label <- V(g)\$name
V(g)\$degree <- degree(g)
# set seed to make the layout reproducible
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g)
plot(g, layout=layout1) As of now we have not incorporated the strength of confidence or the support to plot this graph. Something to observe in this plot is that products like A and B are not connected. This is simply because they never co-exist together in any transaction. This plot can be use to visualize the negative lift items. Such items should  not be placed near each other. The next step is to incorporate the support of each product in the visual plot.

```V(g)\$label.cex <- 2.2 * V(g)\$degree / max(V(g)\$degree)+ .2
V(g)\$label.color <- rgb(0, 0, .2, .8)
V(g)\$frame.color <- NA
egam <- (log(E(g)\$weight)+0.2) / max(log(E(g)\$weight)+0.2)``` Here, we have incorporated the support of each product. As you can see H and I form the biggest letters and A,B,C and D the smallest. This is an indication of higher and lower support. You can validate these inferences from the initial frequency distribution. The next step is to incorporate the confidence as well in the relationship line width.

```E(g)\$color <- rgb(.5, .5, 0, egam)
E(g)\$width <- egam
# plot the graph in layout1
plot(g, layout=layout1) The final plot makes the entire story clear. We have already seen that H and I have the highest support. Now it is also clear that E-I , I-F and H-F have a high confidence as well. Hence, if a customer buys a product F there is a high propensity that he will also buy product H and I. Hence, following are the rules which we can infer from this analysis :

1. If a customer buys E, he has a high propensity to also buy I.

2.If a customer buys F, he has a high propensity to also buy I.

3. If a customer buys F, he has a high propensity to also buy H.

4. If a customer buys I, there is very small that he will also buy H.

The arrangement of items should flow from these rules in order to maximize the sales.

#### End Notes

Graphical representation of market basket analysis makes the interpretation of the entire puzzle of “probabilities/conditional probability/lift above random events” much simpler than a tabular format. This simplification can be more appreciated when we have a large number of transactions and product list. In case of large lists, we can simply find out using the dimension of product sign and width of the line connecting them to infer out simple rules which otherwise were buried in a matrix of complex probabilities.

Have you ever visualized relationships in a market basket analysis? If you did, what algorithm did you use? Did you find the article useful? Did this article solve any of your existing dilemma?

### If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

You can also read this article on Analytics Vidhya's Android APP • Apurv Rohiwal says:

Which tool did you use to write that code ?

• Tavish Srivastava says:

I used R for this application.

• Clarence Wong says:

Hi Tavish,

Excellent work.

You could also try the arulesViz package which provides network visualizations

Example

plot(subrules2, method="graph",control=list(type="items"))

there are more control parameters which you can read about and apply if required.

• Tavish Srivastava says:

Thanks Clarence for sharing the additional code.

• Maurice Nsabimana says:

Well done, thank you for sharing!

Two minor corrections may be in order:
-Refer to nine (9) products/vectors rather than eight (8)?
-Rule four above is about I -> E, not I -> I?

Thanks!

• Tavish Srivastava says:

Thanks Maurice,

You are right. Both the changes have been made.

• Gurdal Ertek says:

In 2006, we had developed a graph-based visualization scheme for visualizing item associations. I suggest that you take a look at this visualization, which was implemented in an R package:

http://ertekprojects.com/gurdal-ertek-publications/blog/2014/07/12/a-framework-for-visualizing-association-mining-results/

• Tavish Srivastava says:

Gurdal,

• Vaishali says: