Mining frequent items bought together using Apriori Algorithm (with code in R)

Last Updated : 07 Nov, 2023

10 min read

Introduction:

We live in a fast changing digital world. In today’s age customers expect the sellers to tell what they might want to buy. I personally end up using Amazon’s recommendations almost in all my visits to their site.

This creates an interesting threat / opportunity situation for the retailers.

If you can tell the customers what they might want to buy – it not only improves your sales, but also the customer experience and ultimately life time value.

On the other hand, if you are unable to predict the next purchase, the customer might not come back to your store.

In this article, we will learn one such algorithm which enables us to predict the items bought together frequently. Once we know this, we can use it to our advantage in multiple ways.

Introduction:
1. The Approach(Apriori Algorithm)
- 1.1 Handling and Readying The Dataset
- 1.2 Structural Overview and Prerequisites
2. Implementing Apriori Algorithm and Key Terms and Usage
3. Interpretations and Analysis
How does the Apriori algorithm work?
FAQs
End Notes and Summary

1. The Approach(Apriori Algorithm)

When you go to a store, would you not want the aisles to be ordered in such a manner that reduces your efforts to buy things?

For example, I would want the toothbrush, the paste, the mouthwash & other dental products on a single aisle – because when I buy, I tend to buy them together. This is done by a way in which we find associations between items.

In order to understand the concept better, let’s take a simple dataset (let’s name it as Coffee dataset) consisting of a few hypothetical transactions. We will try to understand this in simple English.

The Coffee dataset consisting of items purchased from a retail store.

Coffee dataset:

The Association Rules:

For this dataset, we can write the following association rules: (Rules are just for illustrations and understanding of the concept. They might not represent the actuals).

Rule 1: If Milk is purchased, then Sugar is also purchased.

Rule 2: If Sugar is purchased, then Milk is also purchased.

Rule 3: If Milk and Sugar are purchased, Then Coffee powder is also purchased in 60% of the transactions.

Generally, association rules are written in “IF-THEN” format. We can also use the term “Antecedent” for IF (LHS) and “Consequent” for THEN (RHS).

From the above rules, we understand the following explicitly:

Whenever Milk is purchased, Sugar is also purchased or vice versa.
If Milk and Sugar are purchased then the coffee powder is also purchased. This is true in 3 out of the 5 transactions.

For example, if we see {Milk} as a set with one item and {Coffee} as another set with one item, we will use these to find sets with two items in the dataset such as {Milk,Coffee} and then later see which products are purchased with both of these in our basket.

Therefore now we will search for a suitable right hand side or Consequent. If someone buys Coffee with Milk, we will represent it as {Coffee} => {Milk} where Coffee becomes the LHS and Milk the RHS.

When we use these to explore more k-item sets, we might find that {Coffee,Milk} => {Tea}.

That means the people who buy Coffee and Milk have a possibility of buying Tea as well.

Let us see how the item sets are actually built using the Apriori.

LHS	RHS	Count
Milk	–	300
Coffee	–	200
Tea	–	200
Sugar	–	150
Milk	Coffee	100
Tea	Sugar	80
Milk, Coffee	Tea	40
Milk, Coffee, Tea	Sugar	10

Apriori envisions an iterative approach where it uses k-Item sets to search for (k+1)-Item sets. The first 1-Item sets are found by gathering the count of each item in the set. Then the 1-Item sets are used to find 2-Item sets and so on until no more k-Item sets can be explored; when all our items land up in one final observation as visible in our last row of the table above. One exploration takes one scan of the complete dataset.

An Item set is a mathematical set of products in the basket.

1.1 Handling and Readying The Dataset

The first part of any analysis is to bring in the dataset. We will be using an inbuilt dataset “Groceries” from the ‘arules’ package to simplify our analysis.

All stores and retailers store their information of transactions in a specific type of dataset called the “Transaction” type dataset.

The ‘pacman’ package is an assistor to help load and install the packages. we will be using pacman to load the arules package.

The p_load() function from “pacman” takes names of packages as arguments.

If your system has those packages, it will load them and if not, it will install and load them.

Example:

pacman::p_load(PACKAGE_NAME)

pacman::p_load(arules, arulesViz)

OR

Library(arules)

Library(arulesViz)

data(“Groceries")

1.2 Structural Overview and Prerequisites

Before we begin applying the “Apriori” algorithm on our dataset, we need to make sure that it is of the type “Transactions”.

str(Groceries)

The structure of our transaction type dataset shows us that it is internally divided into three slots: Data, itemInfo and itemsetInfo.

The slot “Data” contains the dimensions, dimension names and other numerical values of number of products sold by every transaction made.

These are the first 12 rows of the itemInfo list within the Groceries dataset. It gives specific names to our items under the column “labels”. The “level2” column segregates into an easier to understand term, while “level1” makes the complete generalisation of Meat.

The slot itemInfo contains a Data Frame that has three vectors which categorizes the food items in the first vector “Labels”.

The second & third vectors divide the food broadly into levels like “baby food”,”bags” etc.

The third slot itemsetInfo will be generated by us and will store all associations.

This is what the internal visual of any transaction dataset looks like and there is a dataframe containing products bought in each transaction in our first inspection. Then, we can group those products by TransactionID like we did in our second inspection to see how many times each is sold before we begin with associativity analysis.

The above datasets are just for a clearer visualisation on how to make a Transaction Dataset and can be reproduced using the following code:

data <- list(
c("a","b","c"),
c("a","b"),
c("a","b","d"),
c("b","e"),
c("b","c","e"),
c("a","d","e"),
c("a","c"),
c("a","b","d"),
c("c","e"),
c("a","b","d","e"),
c("a",'b','e','c')
)
data <- as(data, "transactions")

inspect(data)

#Convert transactions to transaction ID lists

tl <- as(data, "tidLists")
inspect(tl)

Let us check the most frequently purchased products using the summary function.

summary(Groceries)

The summary statistics show us the top 5 items sold in our transaction set as “Whole Milk”,”Other Vegetables”,”Rolls/Buns”,”Soda” and “Yogurt”. (Further explained in Section 3)

To parse to Transaction type, make sure your dataset has similar slots and then use the as() function in R.

2. Implementing Apriori Algorithm and Key Terms and Usage

rules <- apriori(Groceries,

parameter = list(supp = 0.001, conf = 0.80))

We will set minimum support parameter (minSup) to .001.

We can set minimum confidence (minConf) to anywhere between 0.75 and 0.85 for varied results.

I have used support and confidence in my parameter list. Let me try to explain it:

Support: Support is the basic probability of an event to occur. If we have an event to buy product A, Support(A) is the number of transactions which includes A divided by total number of transactions.

Confidence: The confidence of an event is the conditional probability of the occurrence; the chances of A happening given B has already happened.

Lift: This is the ratio of confidence to expected confidence.The probability of all of the items in a rule occurring together (otherwise known as the support) divided by the product of the probabilities of the items on the left and right side occurring as if there was no association between them.

The lift value tells us how much better a rule is at predicting something than randomly guessing. The higher the lift, the stronger the association.

Let’s find out the top 10 rules arranged by lift.

inspect(rules[1:10])

As we can see, these are the top 10 rules derived from our Groceries dataset by running the above code.

The first rule shows that if we buy Liquor and Red Wine, we are very likely to buy bottled beer. We can rank the rules based on top 10 from either lift, support or confidence.

Let’s plot all our rules in certain visualisations first to see what goes with what item in our shop.

3. Interpretations and Analysis

Let us first identify which products were sold how frequently in our dataset.

3.1 The Item Frequency Histogram

These histograms depict how many times an item has occurred in our dataset as compared to the others.

The relative frequency plot accounts for the fact that “Whole Milk” and “Other Vegetables” constitute around half of the transaction dataset; half the sales of the store are of these items.

arules::itemFrequencyPlot(Groceries,topN=20,col=brewer.pal(8,'Pastel2'),main='Relative Item Frequency Plot',type="relative",ylab="Item Frequency (Relative)")

This would mean that a lot of people are buying milk and vegetables!

What other objects can we place around the more frequently purchased objects to enhance those sales too?

For example, to boost sales of eggs I can place it beside my milk and vegetables.

3.2 Graphical Representation

Moving forward in the visualisation, we can use a graph to highlight the support and lifts of various items in our repository but mostly to see which product is associated with which one in the sales environment.

plot(rules[1:20],

method = "graph",

control = list(type = "items"))

This representation gives us a graph model of items in our dataset.

The size of graph nodes is based on support levels and the colour on lift ratios. The incoming lines show the Antecedants or the LHS and the RHS is represented by names of items.

The above graph shows us that most of our transactions were consolidated around “Whole Milk”.

We also see that all liquor and wine are very strongly associated so we must place these together.

Another association we see from this graph is that the people who buy tropical fruits and herbs also buy rolls and buns. We should place these in an aisle together.

3.3 Individual Rule Representation

The next plot offers us a parallel coordinate system of visualisation. It would help us clearly see that which products along with which ones, result in what kinds of sales.

As mentioned above, the RHS is the Consequent or the item we propose the customer will buy; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.

The topmost rule shows us that when I have whole milk and soups in my shopping cart, I am highly likely to buy other vegetables to go along with those as well.

plot(rules[1:20],

method = "paracoord",

control = list(reorder = TRUE))

If we want a matrix representation, an alternate code option would be:

plot(rules[1:20],

method = "matrix",

control = list(reorder = TRUE)

3.4 Interactive Scatterplot

These plots show us each and every rule visualised into a form of a scatterplot. The confidence levels are plotted on the Y axis and Support levels on the X axis for each rule. We can hover over them in our interactive plot to see the rule.

Plot: arulesViz::plotly_arules(rules)

The plot uses the arulesViz package and plotly to generate an interactive plot. We can hover over each rule and see the Support, Confidence and Lift.

As the interactive plot suggests, one rule that has a confidence of 1 is the one above. It has an exceptionally high lift as well, at 5.17.

How does the Apriori algorithm work?

Step 1 – Find frequent items:
- It starts by identifying individual items (like products in a store) that appear frequently in the dataset.
Step 2 – Generate candidate itemsets:
- Then, it combines these frequent items to create sets of two or more items. These sets are called “itemsets”.
Step 3 – Count support for candidate itemsets:
- Next, it counts how often each of these itemsets appears in the dataset.
Step 4 – Eliminate infrequent itemsets:
- It removes itemsets that don’t meet a certain threshold of frequency, known as the “support threshold”. This threshold is set by the user.
Repeat Steps 2-4:
- The process is repeated, creating larger and larger itemsets, until no more can be made.
Find associations:
- Finally, Apriori uses the frequent itemsets to find associations. For example, if “bread” and “milk” are often bought together, it will identify this as an association.

FAQs

Q1. How can we make the Apriori algorithm more efficient?

1. Use hash tables and transaction reduction to reduce scanning.
2. Partition the dataset and sample a subset to improve scalability.
3. Use dynamic counting, early termination, and parallelization to optimize computations.
4. Employ data compression and specialized hardware for further efficiency gains.

Q2. Which is better Apriori or FP growth?

FP-Growth is generally considered more efficient than Apriori for large datasets due to its ability to mine frequent itemsets with fewer scans of the data. Apriori may be a better choice for smaller datasets or when simplicity is a priority.

Q3. What are the factors affecting Apriori algorithm?

1.Larger datasets increase complexity and execution time.
2. More items increase candidate itemsets to evaluate.
3. Higher support reduces frequent itemsets but increases candidate itemsets.
4. Wider transactions increase itemsets and impact performance.
5.Sparser data makes it harder to identify frequent itemsets.

End Notes and Summary

By visualising these rules and plots, we can come up with a more detailed explanation of how to make business decisions in retail environments.

Now, we would place “Whole Milk” and “Vegetables” beside each other; “Wine” and “Bottled Beer” alongside too.

I can make some specific aisles now in my store to help customers pick products easily from one place and also boost the store sales simultaneously.

Aisles Proposed:

Groceries Aisle – Milk, Eggs and Vegetables
Liquor Aisle – Liquor, Red/Blush Wine, Bottled Beer, Soda
Eateries Aisle – Herbs, Tropical Fruits, Rolls/Buns, Fruit Juices, Jams
Breakfast Aisle – Cereals, Yogurt, Rice, Curd

This analysis would help us improve our store sales and make calculated business decisions for people both in a hurry and the ones leisurely shopping.

Happy Association Mining!

Algorithm Business Analytics Intermediate R Statistics

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Vignesh Prajapati

HI Shantanu Kumar, Thanks for the great post on Apriori. I want to know how does it work with big data set of transactions (eg. more than 4 GB in size of file.)

Show 1 reply

Shantanu Kumar

The Algorithm will first create all associativity rules from any Transactional Type Dataset, then statically access those using Breadth First Search. I believe 4GB will not create any issues in processing, as once the rules are created you only have to access them in O(1) if you know your LHS. If you want, you can send across the dataset and I'll look into it.

Rodrigo Esquivel

Good Day Shantanu Kumar: Thanks for your posting. I learned new possibilities to Association Rules. I have a technical question. I noticed that for some odd reason if I use the read,transactions function with a csv file the results will differ if I use it against a transaction set extracted from a Database table( using the package RODBC) in both cases is reading using the same structure. I do not know if you had that experience and could give some lights about it. Thanks in advance

Show 1 reply

Shantanu Kumar

Hi! Thanks a lot for the appreciation, glad it was of some use to you :) I've had a similar problem and there's usually the issue because of the functions that you use. The packages have described each function in their own way, so the internal processing for different package functions may be different. Try looking at the source code for those functions or show me your output datasets and I'll help you solve the issue.

Jarad

Are you familiar with steps that remove redundant rules? I've seen several different approaches to them but each kind of do a different thing and nobody can seem to answer this.

Show 1 reply

Shantanu Kumar

Hi, there's two ways to approach redundancy. One is through the raw code where you scan each item and check for redundancies. The other and easier method is to use the "is.redundant()" function on your rules to identify them one by one for the same. Hope this answers your doubt.

Zizie

Hi Shantanu Kumar, Thanks for the great tutorial

Show 1 reply

Shantanu Kumar

Thanks a lot for the appreciation! :)

Tejaswini

Hello..I wanted to ask if this same technique can be used in life insurance sector?to predict which combination of policies (endowment,term life etc) should be offered together so that it gets issued?

Show 1 reply

Shantanu Kumar

Hi Tejaswini, and this can absolutely be used in any form of association building. Depending on your data, all you will need to know is how to bring it to the type of "Transactions" and the rest can be explored!

Tejaswini

Thank you so much for the help.

Prince

Hi Shantanu, Do you have a sample data set to work with this article. Thanks

Show 1 reply

Shantanu Kumar

Hi Prince, The dataset that I have used in this article is in-built with the R Package arules . Once you import the package, simply type data("Groceries")

Hemant Kumar Sain

i want to do a market basket analysis and I’m trying to create a dataset for that i have two tables, one table contains daily transaction of products in which each row of table shows item purchased by the customer, The second table contains parent group under those products are fallen, for example under fruit category there are several fruits like mango, banana, apple etc. i want to create a third table in which parent group are mentioned as header which can be extracted from Table 2, and all the rows represent transaction of products with their names, and if there is no transaction for any parent category then the cell supposed to fill as NA. please help me with R or C/c++ code( R would be preferred) here I’m attaching you all three tables for better reference i have first two tables and i want to get a table like table 3 Table 1 : Item_1 Item_2 Item_3 Item_4 T1 1KG banana 300ML milk 1kg sugar NA T2 2 Large Corona Beer 2 pack Fries NA NA T3 2 Lux Soap 1kg sugar Na Na Table 2 : Toiletries Fruits Beverages Snacks Vegetables Clothings Dairy Products Soap Banana Corona Beer King Burger Pumpkin Adidas Sport Tshirt XL Milk Shampoo Mango Red Label Whisky Potato Fries Potato Nike Shorts Black L Butter Showergel Oranges grey Cocktail cheese pizza Tomato Puma Jersy red M Suger Lux Soap 2 Large corona Beer Cheese Toothpest Table 3 : toiletries Fruits Beverages Snacks Vegetables Clothings Dairy Products NA 1kg Banana NA NA or NA NA 300ml milk,1kg Sugar NA NA 2 Large corona Beer 2 pack Fries NA NA NA 2 Lux Soap NA

Eswar

Well detailed analysis! Thanks for sharing your knowledge.

Aaa

Thanks for the analysis

Show 1 reply

Aishwarya Singh

Hi, Glad you found it useful

sahir

Thank you so much Shanthanu kumar

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Mining frequent items bought together using Apriori Algorithm (with code in R)

Introduction:

Table of contents

1. The Approach(Apriori Algorithm)

1.1 Handling and Readying The Dataset

1.2 Structural Overview and Prerequisites

2. Implementing Apriori Algorithm and Key Terms and Usage

3. Interpretations and Analysis

3.1 The Item Frequency Histogram

3.2 Graphical Representation

3.3 Individual Rule Representation

3.4 Interactive Scatterplot

How does the Apriori algorithm work?

FAQs

End Notes and Summary

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or