Nowadays, Machine Learning is helping the Retail Industry in many different ways. You can imagine that from forecasting the performance of sales to identifying the buyers, there are many applications of AI and ML in the retail industry. Market basket analysis is a data mining technique retailers use to increase sales by better understanding customer purchasing patterns. Analyzing large data sets, such as purchase history, reveals product groupings and products likely to be purchased together. In this article, we will comprehensively cover the topic of Market Basket Analysis Python and its various components and then dive deep into the ways of implementing it in machine learning, including how to perform it in Python on a real-world dataset.
This article was published as a part of the Data Science Blogathon.
Market basket analysis is a strategic data mining technique used by retailers to enhance sales by gaining a deeper understanding of customer purchasing patterns.This method involves examining substantial datasets, such as historical purchase records, to unveil inherent product groupings and identify items that customers tend to buy together.
By recognizing these patterns of co-occurrence, retailers can make informed decisions to optimize inventory management, devise effective marketing strategies, employ cross-selling tactics, and even refine store layout for improved customer engagement.
For example, if customers are buying milk, how probably are they to also buy bread (and which kind of bread) on the same trip to the supermarket? This information may lead to an increase in sales by helping retailers to do selective marketing based on predictions, cross-selling, and planning their ledge space for optimal product placement.
Now, just think of the universe as the set of items available at the store, then each item has a Boolean variable that represents the presence or absence of that item. Now, we can represent each basket with a Boolean vector of values assigned to these variables. We can then analyze the Boolean vectors to identify purchase patterns that reflect items frequently associated or bought together, representing such patterns in the form of association rules.
Industry | Applications of Market Basket Analysis |
---|---|
Retail | Identify frequently purchased product combinations and create promotions or cross-selling strategies |
E-commerce | Suggest complementary products to customers and improve the customer experience |
Hospitality | Identify which menu items are often ordered together and create meal packages or menu recommendations |
Healthcare | Understand which medications are often prescribed together and identify patterns in patient behavior or treatment outcomes |
Banking/Finance | Identify which products or services are frequently used together by customers and create targeted marketing campaigns or bundle deals |
Telecommunications | Understand which products or services are often purchased together and create bundled service packages that increase revenue and improve the customer experience |
Let I = {I1, I2,…, Im} be an itemset. These itemsets are called antecedents. Let D, the data, be a set of database transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each transaction is associated with an identifier called a TID(or Tid). Let A be a set of items(itemset). T is the Transaction that is said to contain A if A ⊆ T. An Association Rule is an implication of form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩B = φ.
The rule A ⇒ B holds in the data set(transactions) D with supports, where ‘s’ is the percentage of transactions in D that contain A ∪ B (i.e., the union of set A and set B, or both A and B). This is taken as the probability, P(A ∪ B). Rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contains B. This is taken to be the conditional probability, like P(B|A). That is,
Rules that meet both a minimum support threshold (called min sup) and a minimum confidence threshold (called min conf) are termed as ‘Strong’.
Generally, Association Rule Mining can be viewed in a two-step process:
Association Rule Mining is primarily used when you want to identify an association between different items in a set and then find frequent patterns in a transactional database or relational database.
The best example of the association is as you can see in the following image.
There are multiple data mining techniques and algorithms used in Market Basket Analysis Python. In predicting the probability of items that customers are buying together, one of the important objectives is to achieve accuracy.
The Apriori Algorithm widely uses and is well-known for Association Rule mining, making it a popular choice in Market Basket Analysis Python. AI and SETM algorithms consider it more accurate. It helps to find frequent itemsets in transactions and identifies association rules between these items. The limitation of the Apriori Algorithm is frequent itemset generation. It needs to scan the database many times, leading to increased time and reduced performance as a computationally costly step because of a large dataset. It uses the concepts of Confidence and Support.
The AIS algorithm creates multiple passes on the entire database or transactional data. During every pass, it scans all transactions. As you can see, in the first pass, it counts the support of separate items and determines then which of them are frequent in the database. After each transaction scan, the algorithm enlarges huge itemsets from each pass to generate candidate itemsets. It determines the common itemsets between the itemsets of the previous pass and the items of the current transaction. This algorithm, developed to generate all large itemsets in a transactional database, was the first published algorithm of its kind.
It focused on the enhancement of databases with the necessary performance to process decision support. This technique is bounded to only one item in the consequent.
This Algorithm is quite similar to the AIS algorithm. The SETM algorithm creates collective passes over the database. As you can see, in the first pass, it counts the support of single items and then determines which of them are frequent in the database. Then, it also generates the candidate itemsets by enlarging large itemsets of the previous pass. In addition to this, the SETM algorithm recalls the TIDs(transaction ids) of the generating transactions with the candidate itemsets.
It is known as Frequent Pattern Growth Algorithm. FP growth algorithm is a concept of representing the data in the form of an FP tree or Frequent Pattern. Hence FP Growth is a method of Mining Frequent Itemsets. This algorithm is an advancement to the Apriori Algorithm. There is no need for candidate generation to generate a frequent pattern. This frequent pattern tree structure maintains the association between the itemsets.
A Frequent Pattern Tree is a tree structure that is made with the earlier itemsets of the data. The main purpose of the FP tree is to mine the most frequent patterns. Every node of the FP tree represents an item of that itemset. The root node represents the null value, whereas the lower nodes represent the itemsets of the data. While creating the tree, it maintains the association of these nodes with the lower nodes, namely, between itemsets.
For Example:
There are many advantages to implementing Market Basket Analysis in marketing. Market Basket Analysis (MBA) applies to customer data from point of sale (PoS) systems.
It helps retailers in the following ways:
Let us take an example of market basket analysis from Amazon, the world’s largest eCommerce platform. From a customer’s perspective, Market Basket Analysis in Data Mining is like shopping at a supermarket. Generally, it observes all items bought by customers together in a single purchase. Then it shows the most related products together that customers will tend to buy in one purchase.
Let us now implement market basket analysis in python.
Here are the steps involved in using the apriori algorithm to implement MBA:
In this implementation, we have to use the Store Data dataset that is publicly available on Kaggle. This dataset contains a total of 7501 transaction records, where every record consists of a list of items sold in just one transaction.
Data scientists frequently use the Apriori algorithm. We need to import the necessary libraries. Python requires us to import the apyori as an API to execute the Apriori Algorithm.
import pandas as pd
import numpy as np
from apyori import apriori
Now, we want to read the dataset we downloaded from Kaggle.There is no header in the dataset; hence, the first row contains the first transaction, so we have mentioned header = None here.
Once we have completely read the dataset, we must obtain the list of items in every transaction. So we are going to run two loops. One will be for the total number of transactions, and the other will be for the total number of columns in every transaction. The list will work as a training set from where we can generate the list of Association Rules.
#converting dataframe into list of lists
l=[]
for i in range(1,7501):
l.append([str(st_df.values[i,j]) for j in range(0,20)])
So we are ready with the list of items in our training set, then we need to run the apriori algorithm, which will learn the list of association rules from the training set, i.e., list. So, we take the minimum support here as 0.0045. Now let us see that we have kept 0.2 as the min confidence.We take the minimum lift value as 3, and we consider the minimum length as 2 because we need to find an association among at least two items.
#applying apriori algorithm
association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)
After running the above line of code, we generated the list of association rules between the items. Now, we want to read the dataset we downloaded from Kaggle.
for i in range(0, len(association_results)):
print(association_results[i][0])
frozenset({'light cream', 'chicken'})
frozenset({'mushroom cream sauce', 'escalope'})
frozenset({'pasta', 'escalope'})
frozenset({'herb & pepper', 'ground beef'})
frozenset({'tomato sauce', 'ground beef'})
frozenset({'whole wheat pasta', 'olive oil'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'light cream', 'chicken'})
frozenset({'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'cooking oil', 'ground beef'})
frozenset({'mushroom cream sauce', 'nan', 'escalope'})
frozenset({'nan', 'pasta', 'escalope'})
frozenset({'spaghetti', 'frozen vegetables', 'ground beef'})
frozenset({'olive oil', 'frozen vegetables', 'milk'})
frozenset({'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'olive oil', 'frozen vegetables'})
frozenset({'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'grated cheese', 'ground beef'})
frozenset({'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'ground beef'})
frozenset({'spaghetti', 'herb & pepper', 'ground beef'})
frozenset({'olive oil', 'milk', 'ground beef'})
frozenset({'nan', 'tomato sauce', 'ground beef'})
frozenset({'spaghetti', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'olive oil', 'milk'})
frozenset({'soup', 'olive oil', 'mineral water'})
frozenset({'whole wheat pasta', 'nan', 'olive oil'})
frozenset({'nan', 'shrimp', 'pasta'})
frozenset({'spaghetti', 'olive oil', 'pancakes'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'nan', 'cooking oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'ground beef'})
frozenset({'spaghetti', 'frozen vegetables', 'milk', 'mineral water'})
frozenset({'nan', 'frozen vegetables', 'milk', 'olive oil'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'olive oil'})
frozenset({'spaghetti', 'nan', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'nan', 'grated cheese', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'spaghetti', 'nan', 'herb & pepper', 'ground beef'})
frozenset({'nan', 'milk', 'olive oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'nan', 'milk', 'olive oil'})
frozenset({'soup', 'nan', 'olive oil', 'mineral water'})
frozenset({'spaghetti', 'nan', 'olive oil', 'pancakes'})
frozenset({'spaghetti', 'milk', 'mineral water', 'nan', 'frozen vegetables'})
Here we are going to display the Rule, Support, and lift ratio for every above association rule by using for loop.
for item in association_results:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
# second index of the inner list
print("Support: " + str(item[1]))
# third index of the list located at 0th position
# of the third index of the inner list
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("-----------------------------------------------------")
Rule: light cream -> chicken
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: mushroom cream sauce -> escalope
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: pasta -> escalope
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: herb & pepper -> ground beef
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: whole wheat pasta -> olive oil
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: nan -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> cooking oil
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: mushroom cream sauce -> nan
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: nan -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: olive oil -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> grated cheese
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: herb & pepper -> mineral water
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: spaghetti -> herb & pepper
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: olive oil -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: nan -> tomato sauce
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> olive oil
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: whole wheat pasta -> nan
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------
Rule: nan -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: nan -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> nan
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: spaghetti -> milk
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------
In this tutorial, we discussed Market Basket Analysis and learned the steps to implement it from scratch using Python. We then implemented Market Basket Analysis using Apriori Algorithm. We also looked into the various uses and advantages of this algorithm and learned that we could also use FP Growth and AIS algorithms to implement Market Basket Analysis in Data Mining.
A. The purpose of the market basket is to analyze consumer purchasing patterns and identify product associations.
A. To calculate a market basket, count the number of transactions containing a set of items and analyze co-occurrence.
A. Amazon uses market basket analysis to recommend products by identifying frequently bought together items and improving cross-selling strategies.
A. Market basket analysis for pricing helps determine optimal pricing strategies by understanding how product bundles influence purchasing decisions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Hi, I liked your article. I have a question regarding the parameter that you choose for the apriori apriori algorithm. association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2) Could you please tell me how you choose these values? Thanks
Great post! I learned a lot from it.
This is a great guide for businesses. I would recommend it to anyone looking to improve their business.
This is a great guide for businesses. I would recommend it to anyone looking to improve their business.
Hi, Thanks for this great article. Please how did you deal with the large amount of missing values and how does it affect your results? Thank you