Shalaka Kulkarni — Published On May 29, 2022 and Last Modified On June 9th, 2022
Clustering Datasets Guide Intermediate Python

This article was published as a part of the Data Science Blogathon.

Introduction on RFM Analysis

This article aims to take you through the important concept of Customer Segmentation using RFM Analysis and how it can be done using machine learning. The algorithms we will be using are RFM analysis and comparing it with the clusters formed using the K-means clustering algorithm. For better personalized recommendations, I have used Market Basket Analysis using the Apriori algorithm for the same. Also, I have used Item Based Collaborative Filtering and compared the performance of both Market Basket analysis and Item Based Collaborative Filtering in the end.

Lets take one concept at a time and start with RFM analysis. RFM analysis is the Recency, Frequency, Monetary Analysis in Marketing Analytics where the ‘R’ factor is about when was the last time a customer made a purchase, the ‘F’ factor is about the number of purchases made in a given period and the ‘M’ factor is the total amount of money spent by the customer in the given time.

I assume that you are already familiar with the concepts of K-means clustering algorithm, Market basket Analysis using Apriori algorithm and Item based Collaborative Filtering. In this blog I am going to take the clusters or segments formed with the help of RFM analysis and apply Market basket Analysis using the Apriori algorithm and Item based Collaborative Filtering on them for better and more personalized product recommendations according to the segment assigned to the customer.
I will be using Python for the implementation, which will give you a hands-on experience. Lets jump on!

Importing Required Libraries

In this section we will import pandas, numpy, matplotlib, seaborn, StandardScaler, Kmeans, etc. These libraries are required for data processing, visualization and the last two ones for machine learning algorithms respectively.

# import library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
#For Data  Visualization
import matplotlib.pyplot as plt
import seaborn as sns
#For Machine Learning Algorithm
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import os

Loading of Dataset and Finding Details About it

You need to load the dataset using a panda library. After the data is uploaded, you can check information about the datatypes of the data and see the summary of the dataframe by using the .info() function. Also, you can use .describe(),.shape() functions to have an overall understanding of the data.

df = pd.read_excel(r'D:ProjectOnline_Mart_Dataset.xlsx')

Data Preprocessing

Remove Null and Duplicate Values

As the data is transactional data, it is very important to perform data pre-processing in order to make it suitable for further analysis. A few of the customer IDs and descriptions were found to be missing. It could be matched to the stock code and filled the value but the unit price for these rows was found to be missing and hence they were deleted. Even after deletion of all these rows, some customer IDs were still missing and were replaced by the customer Id from the data frame. Duplicate entries were checked and deleted. This will not affect our market basket analysis.

df= df.dropna(subset=['CustomerID'])
df = df.drop_duplicates()

Explorartory Data Analysis

After pre-processing of data, the next step was to perform Exploratory Data Analysis (EDA). By importing the word cloud library, we plotted the word cloud to find out the most popular items bought first by the customers. The most frequently sold items were found. Following is the code for the word cloud.

import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
plt.rcParams['figure.figsize'] = (10, 10)
wordcloud = WordCloud(background_color = 'white', width = 1200,  height = 1200, max_words = 20).generate(str(newsales['Description']))
plt.title('Most Popular Items bought first by the Customers',fontsize = 40)

The following code will give the most frequently sold items.

freqprod= newsales.groupby(["StockCode", "Description"])["Description"].count().sort_values(axis= 0,ascending =False)
top5freq = freqprod[:5,].sort_values(ascending = True)
top5freq.plot(kind = "barh")
plt.xlabel('Quantities Sold')
plt.title('Most Frequently Sold Products')

The following code will give you the month in which maximum number of transactions were carried out.

m.plot(kind = "barh")
plt.xlabel("No.of transactions")
plt.title("Months with Most Transactions")

Similar kinds of insights can be found depending on the requirements, like what the peak hours were, the costliest products etc.

Model Building

RFM Analysis

RFM Analysis was performed after carrying out the exploratory data analysis. RFM analysis has support for the marketing proverb that “80% of business comes from 20% of the customer”. RFM scores were found for the customers and they were grouped into RFM quartiles/segments. This segmented the customers into three segments viz Gold, Silver and Bronze. This helps to understand the buying patterns of the customer and which customers to target.Thus, RFM analysis basically gives a snapshot of the customers and helps target the customers better in order to make profits and maintain customer loyalty. The table below gives the segments of customers based on the calculated RFM score.

Segment Range RFM Score Range
Gold >9
Silver >5 and <=9
Bronze >=5

The following code snippet shows how to build the RFM segments.

#Building RFM segments
r_labels =range(4,0,-1)
r_quartiles = pd.qcut(rfm['Recency'], q=4, labels = r_labels)
f_quartiles = pd.qcut(rfm['Frequency'],q=4, labels = f_labels)
m_quartiles = pd.qcut(rfm['MonetaryValue'],q=4,labels = m_labels)
rfm = rfm.assign(R=r_quartiles,F=f_quartiles,M=m_quartiles)
# Build RFM Segment and RFM Score
def add_rfm(x) : return str(x['R']) + str(x['F']) + str(x['M'])
rfm['RFM_Segment'] = rfm.apply(add_rfm,axis=1 )
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis=1)

Using the RFM score to group customers into Gold, Silver and Bronze segments:

def segments(df):
    if df['RFM_Score'] > 9 :
        return 'Gold'
    elif (df['RFM_Score'] > 5) and (df['RFM_Score'] <= 9 ):
        return 'Silver'
        return 'Bronze'
rfm['General_Segment'] = rfm.apply(segments,axis=1)

Then comes the main step where we merge the RFM segments with the main dataframe so that we carry out our further analysis. We merge them on the basis of CustomerId.


So we created 3 separate dataframes according to the segments of Gold, Silver and Bronze for further conducting market basket analysis on each of the dataframe..

Bronze_seg = MergedRFM[MergedRFM.General_Segment == 'Bronze']
Silver_seg = MergedRFM[MergedRFM.General_Segment == 'Silver']
Gold_seg = MergedRFM[MergedRFM.General_Segment == 'Gold']

K- Means Clustering

The K-Means clustering algorithm is basically an unsupervised machine learning algorithm which is used to classify the available dataset into various clusters. It identifies the K number of centroids and then allocates each data point to the nearest cluster. There are two ways in which distance can be measured between the clusters. They are called Within Cluster Sum of Squares (WCSS) and Between Clusters Sum of Squares (BCSS).

The data needs to be scaled because the K-means clustering algorithm uses the distance which is the factor of similarity. Scaling and normalizing data is a critical step in preprocessing the data. The distribution of RFM values is right-skewed in this project.Therefore, the standardization and normalization were necessary. The RFM values are log scaled first and then normalized. A log transformation is applied for each RFM value and the StandardScaler() library is used for standardization. The elbow method is then used to find out the right number of clusters which have come up to be 3.Clusters are named 0,1,2.This validates the RFM segments created.

#Unskew the data with log transformation
rfm_log = rfm[['Recency', 'Frequency', 'MonetaryValue']].apply(np.log, axis = 1).round(3)
#or rfm_log = np.log(rfm_rfm)
# plot the distribution of RFM values
f,ax = plt.subplots(figsize=(10, 12))
plt.subplot(3, 1, 1); sns.distplot(rfm_log.Recency, label = 'Recency')
plt.subplot(3, 1, 2); sns.distplot(rfm_log.Frequency, label = 'Frequency')
plt.subplot(3, 1, 3); sns.distplot(rfm_log.MonetaryValue, label = 'Monetary Value')'fivethirtyeight')
#Normalize the variables with StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
#Store it separately for clustering
rfm_normalized= scaler.transform(rfm_log)

K-means Implementation for Customer Segmentation

from sklearn.cluster import KMeans
#First : Get the Best KMeans 
ks = range(1,8)
for k in ks :
    # Create a KMeans clusters
    kc = KMeans(n_clusters=k,random_state=1)
# Plot ks vs inertias
f, ax = plt.subplots(figsize=(15, 8))
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.title('What is the Best Number for KMeans ?')

Comparison of Quartile Analysis (for RFM) with K- Means Clustering

For the comparison of both these segmentation techniques, the use of a snake plot and a heat map is done. Snake plot is a line plot which is used in Marketing analytics a lot and gives the idea of comparison of different segments. For snake plots to work, the data must be normalized. It plots each cluster’s average normalized value of every attribute. For effective plotting, the dataframe must be melted in such a way that the metric columns are divided into two columns. First, the name of the metric, and second, the current numeric value.

Heat Maps are basically the graphical representation of data values using a color code. The higher values are represented in dark colors and lower values are in lighter colors. The variance between the two groups can be drastically shown with the help of colors.

# clustering
kc = KMeans(n_clusters= 3, random_state=1)
#Create a cluster label column in the original DataFrame
cluster_labels = kc.labels_
#Calculate average RFM values and size for each cluster:
rfm_rfm_k3 = rfm_rfm.assign(K_Cluster = cluster_labels)
#Calculate average RFM values and sizes for each cluster:
rfm_rfm_k3.groupby('K_Cluster').agg({'Recency': 'mean','Frequency': 'mean',
                                         'MonetaryValue': ['mean', 'count'],}).round(0)
rfm_normalized = pd.DataFrame(rfm_normalized,index=rfm_rfm.index,columns=rfm_rfm.columns)
rfm_normalized[‘K_Cluster’] = kc.labels_
rfm_normalized[‘General_Segment’] = rfm[‘General_Segment’]
rfm_normalized.reset_index(inplace = True)
#Melt the data into a long format so RFM values and metric names are stored in 1 column each
rfm_melt = pd.melt(rfm_normalized,id_vars=[‘CustomerID’,’General_Segment’,’K_Cluster’],value_vars=[‘Recency’, ‘Frequency’, ‘MonetaryValue’],
#Snake Plots and Heatmap
f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 8))
sns.lineplot(x = ‘Metric’, y = ‘Value’, hue = ‘General_Segment’, data = rfm_melt,ax=ax1)
# a snake plot with K-Means
sns.lineplot(x = ‘Metric’, y = ‘Value’, hue = ‘K_Cluster’, data = rfm_melt,ax=ax2)
plt.suptitle(“Snake Plot of RFM”,fontsize=24) #make title fontsize subtitle
# heatmap with RFM
f, (ax1, ax2) = plt.subplots(1,2, figsize=(15, 5))
sns.heatmap(data=relative_imp, annot=True, fmt=’.2f’, cmap=’Blues’,ax=ax1)
ax1.set(title = “Heatmap of K-Means”)
# a snake plot with K-Means
sns.heatmap(prop_rfm, cmap= ‘Oranges’, fmt= ‘.2f’, annot = True,ax=ax2)
ax2.set(title = “Heatmap of RFM quantile”)
plt.suptitle(“Heat Map of RFM”,fontsize=20) #make title fontsize subtitle

Market Basket Analysis using Apriori Algorithm

Market Basket Analysis is a data mining approach used to find out the buying patterns of customers and in turn increase sales. It is an example of frequent itemset mining. This process takes place by finding associations between the various items that customers place in their shopping baskets or carts. This is known as association rule mining and the Apriori Algorithm is a commonly used Association Rule algorithm in market basket analysis. It is also considered highly accurate. It uses the concept of support, confidence and lift.

In this project, the Apriori algorithm was used to find the frequent item sets, but it was done cluster wise, which was received from the RFM analysis. So there were three frequent item sets for three different customer segments, one for Bronze, one for Silver, one for Gold. Also, for each segment, frequent item sets were calculated for three different values of minimum support, i.e. 0.01, 0.02, 0.03. The results will be discussed in the upcoming sections.

To find out famous items which are brought together, a matrix was created of Customer Id and the product and it is called a co-occurrence matrix and finding the maximum values of it would tell us which items are generally bought together.

basket_bronze = (Bronze_seg.groupby(['InvoiceNo', 'Description'])['Quantity']
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1
basket_bronze_sets = basket_bronze.copy.applymap(encode_units)
basket_bronze_sets.drop('POSTAGE', inplace=True, axis=1)
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
%matplotlib inline
frequent_itemsets_bronze=apriori(basket_bronze_sets, min_support=0.03, use_colnames=True)
#Build frequent itemsets
frequent_itemsets_bronze['length'] = frequent_itemsets_bronze['itemsets'].apply(lambda x: len(x))
rules_bronze = association_rules(frequent_itemsets_bronze, metric="lift", min_threshold=1)

Co-occurrence Matrix

CID_PN_matrix = Bronze_seg.pivot_table(index = [“InvoiceNo”], columns = [“Description”],

                              values = "Quantity").fillna(0)
basket_bronze_set = CID_PN_matrix.applymap(encode_units)
basket_bronze_set_int = basket_bronze_set.astype(int)
coocM_Bronze =

Item Based Collaborative Filtering

Item Based Collaborative Filtering is a type of algorithm based on the similarity between items. Similarities are calculated using the rating users have given to products. For this project, the rating is quantity. Item-based collaborative filtering matches each customer’s purchased and rated items to similar products. Further, it then combines those similar products into a list of recommendations.

In this project, a product is selected (WHITE HANGING HEART T-LIGHT HOLDER)to check which customers had bought it and its quantity. A correlation matrix was calculated for the WHITE HANGING HEART T-LIGHT HOLDER to see which products are similar to it and it was found that there are a lot of them. To solve this problem, and make things more customer specific, a customer is chosen and his purchases are checked and which products are the closest to the ones that have been bought is found using the correlation matrix.

matrix_Bronze = Bronze_seg.pivot_table(index = [“InvoiceNo”], columns = [“Description”],

                              values = "Quantity")
whiteHeart = matrix_Bronze["WHITE HANGING HEART T-LIGHT HOLDER"]
similarProductsW_Bronze = matrix_Bronze.corrwith(whiteHeart)
similarProductsW_Bronze = similarProductsW_Bronze.dropna()
df1 = pd.DataFrame(similarProductsW_Bronze)
corrMatrix_Bronze = matrix_Bronze.corr()
second_customer_Bronze = matrix_Bronze.iloc[1].dropna()
simProducts_Bronze = pd.Series()
#Go through every product bought by second customer
for i in range(0, len(second_customer_Bronze.index)):
    print("Adding sims for " + second_customer_Bronze.index[i] + "....")
    #Retrieve similar products to the ones bought by customer 2
    sims_Bronze = corrMatrix_Bronze[second_customer_Bronze.index[i]].dropna()
    #Scale to how many of the products were bought
    sims_Bronze = x: x * second_customer_Bronze[i])
    # Add to the list of similar products
    simProducts_Bronze = simProducts_Bronze.append(sims_Bronze)
simProducts_Bronze.sort_values(inplace = True, ascending = True)

Model Performance and Comparison

The total number of frequent patterns and association rules for all three segments (Gold, Silver, Bronze) with different support values are shown in the figures below and tables respectively. Many frequent patterns are left out when the minimum support value is raised as they fail to satisfy the threshold value. The algorithm generates all the frequent patterns for all three segments and generates the rules by finding co-relations among frequent item sets with at least 70% confidence and identifies the association rules for both segments.

Our analysis shows the support values decrease or move towards less than or equal to 0.01, sometimes. Sometimes the Apriori Algorithm fails to generate the frequent patterns. This is because it gets involved in an infinite loop. As minimum support increases, the frequent itemset generated decreases.

The following graphs show the number of rules/recommendations at 0.01 level of support for the Gold, Silver and Bronze segments. Similar processing was done for 0.02 and 0.03 levels of minimum support.

In item-based collaborative filtering, a quantity matrix is used to find similarities between items. And based upon these similarities, consumer preference for any product not bought by him is calculated. The cumulative similar products (recommendations) identified for the Gold segment is more compared to the Silver and Bronze segment.
To generate association rules for such heavy datasets, all the algorithms have different run-time due to their unique execution processes. As per our analysis, Apriori was efficient in terms of run-time. IBCF took a long time even with 8 GB RAM and a fast processor.

Algorithm Segment Minimum Support Frequent Item Sets
Apriori Gold 0.01 NA
Silver 0.01 705
Bronze 0.01 253

Item Based Collaborative filtering using RFM Analysis

Algorithm Segment Recommendations
IBCF Gold 4036
Silver 651
Bronze 789

In item-based collaborative filtering, a quantity matrix is used to find similarities between items. And based upon these similarities, consumer preference for any product not bought by him is calculated. The cumulative similar products (recommendations) identified for the Gold segment is more compared to the Silver and Bronze segment.

Comparison of Apriori Algorithm and IBCF w.r.t Execution Time

To generate association rules for such heavy datasets, all the algorithms have different run-time due to their unique execution processes. As per our analysis, Apriori was efficient in terms of run-time. IBCF took a long time even with 8 GB RAM and a fast processor.

Conclusion on RFM Analysis

In this project, the RFM Analysis values (Recency, Frequency, Monetary) were calculated from transactional data, and customer segmentation was performed with two kinds of methods, i.e., RFM quantiles and the K-Means clustering method. With the help of RFM analysis, the customers could be divided into three groups, namely ‘Gold’, ’Silver’, ’Bronze’, with Gold being the most profitable group. This enables us to identify which customers should be focused on and who should be given special discounts/offers or promotions. This helps build customer relations and they are made to feel seen and wanted. We can select the most optimum marketing channel for each segment and build new marketing strategies.

As per this research for Customer Segmentation, we can say that the Apriori algorithm is the most effective and efficient algorithm in terms of identification of frequent patterns, generation of association rules and execution time.

This project is different from the rest of the Market basket analysis available solutions because here, customers from respective segments will receive different recommendations even though the same transaction triggering product is purchased. For example, a customer from the Gold segment may purchase bread and will be recommended a premium product like cake as we have analyzed that this customer is our recent, frequent and highly paying customer and, on the other hand, a customer from the bronze segment may purchase bread and will be recommended a less premium product like eggs or butter.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

About the Author

Shalaka Kulkarni

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *