Introductory Guide – Factorization Machines & their application on huge datasets (with codes in Python)

Ankit Last Updated : 05 Jun, 2020

11 min read

Introduction

I still remember my first encounter with a Click prediction problem. Before this, I had been learning data science and I was feeling good about my progress. I had started to build my confidence in ML hackathons and I was determined to do well in several challenges.

In order to do well, I had even procured a machine with 16 GB RAM and i7 processor. But the first look at the dataset gave me jitters. The data when unzipped was over 50 GB – I had no clue how to predict a click on such a dataset. Thankfully Factorization machines came to my rescue.

Anyone who has worked on a Click Prediction problem or Recommendation systems would have faced a similar situation. Since the datasets are huge, doing predictions for these datasets becomes challenging with limited computation resources.

However, in most cases these datasets are sparse (only a few variables for each training example are non zero) due to which there are several features which are not important for prediction, this is where factorization helps to extract the most important latent or hidden features from the existing raw ones.

Factorization helps in representing approximately the same relationship between the target and predictors using a lower dimension dense matrix. In this article, I discuss Factorization Machines(FM) and Field Aware Factorization Machines(FFM) which allows us to take advantage of factorization in a regression/classification problem with an implementation using python.

Intuition behind Factorization
How Factorization Machines trump Polynomial and linear models?
Field Aware Factorization Machines (FFMs)
Implementation using xLearn Library in Python

Intuition behind Factorization

To get an intuitive understanding of matrix factorization, Let us consider an example: Suppose we have a user-movie matrix of ratings(1-5) where each value of the matrix represents rating (1-5) given by the user to the movie.

	Star Wars I	Inception	Godfather	The Notebook
U1	5	3	–	1
U2	4	–	–	1
U3	1	1	–	5
U4	1	–	–	4
U5	–	1	5	4

We observe from the table above that some of the ratings are missing and we would like to devise a method to predict these missing ratings. The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determines how a user rates a movie. For example – users A and B would rate an Al Pacino movie highly if both of them are fans of actor Al Pacino, here a preference towards a particular actor would be a hidden feature since we are not explicitly including it in the rating matrix.

Suppose we want to compute K hidden or latent features. Our task is to find out the matrices P(U x K) and Q(D x K) (U – Users, D – Movies) such that P x Q^T approximates R which is the rating matrix. Now, each row of P will represent strength of association between user and the feature while each row of Q represents the same strength w.r.t. the movie. To get the rating of a movie d_j rated by user u_i, we can calculate the dot product of 2 vectors corresponding to u_i and d_j All we need to do now is calculate P and Q matrices. We use gradient descent algorithm for doing this. The objective is to minimize the squared error between the actual rating and the one estimated by P and Q. The squared error is given by the following equation.

Now, we need to define an update rule for p_ik and q_kj. The update rule in gradient descent is defined by the gradient of the error to be minimized.

Having obtained the gradient, we can now formulate the update rules for both p_ik and q_kj

Here, α is the learning rate which can control the size of updates. Using the above update rules, we can then iteratively perform the operation until the error converges to its minimum. We can check the overall error as calculated using the following equation and determine when we should stop the process.

The above solution is simple and often leads to overfitting where the existing ratings are predicted accurately but it does not generalize well on unseen data. To tackle this we can introduce a regularization parameter β which will control the user-feature and movie-feature vectors in P and Q respectively and give a good approximation for the ratings.

For anyone interested in python implementation and exact details of the same may go to this link. Once we have calculated P and Q using the above methodology, we get the approximate rating matrix as:

	Star Wars I	Inception	Godfather	The Notebook
U1	4.97	2.98	2.18	0.98
U2	3.97	2.4	1.97	0.99
U3	1.02	0.93	5.32	4.93
U4	1.00	0.85	4.59	3.93
U5	1.36	1.07	4.89	4.12

Notice how we are able to regenerate the existing ratings, moreover we are now able to get a fair approximation to the unknown rating values.

How Factorization Machines trump Polynomial and linear models?

Let us consider a couple of training examples from a click prediction dataset. The dataset is click through related sports news website (publisher) and sports gear firms (advertiser).

Clicked	Publisher (P)	Advertiser (A)	Gender (G)
Yes	ESPN	Nike	Male
No	NBC	Adidas	Male

When we talk about FMs or FFMs, each column (Publisher, Advertiser…) in the dataset would be referred to as a field and each value (ESPN, Nike….) would be referred to as a feature.

A linear or a logistic modeling technique is great and does well in a variety of problems but the drawback is that the model only learns the effect of all variables or features individually rather than in combination.

$\phi _{LM} = w_{0} + w_{ESPN}x_{ESPN} + w_{Nike}x_{Nike} + w_{Adidas}x_{Adidas} + w_{NBC}x_{NBC} + w_{Male}x_{Male}$

Where w₀, w_ESPN etc. represent the parameters and x_ESPN, x_Nikerepresent the individual features in the dataset. By minimizing the log-loss for the above function we get logistic regression. One way to capture the feature interactions is a polynomial function that learns a separate parameter for the product of each pair of features treating each product as a separate variable.

$\phi _{Poly2} = w_{0} + w_{ESPN}x_{ESPN} + w_{Nike}x_{Nike} + w_{Adidas}x_{Adidas} + w_{NBC}x_{NBC} + w_{Male}x_{Male}$

$+ w_{ESPN,Nike}x_{ESPN}x_{Nike} + w_{ESPN,Male}x_{ESPN}x_{Male} + .............$

This can also be referred to as Poly2 model as we are only considering combination of 2 features for a term.

The problem with this is that even for a medium sized dataset, we have a huge model that has terrible implications for both the amount of memory needed to store the model and the time it takes to train the model.
Secondly, for a sparse dataset this technique will not do well to learn all the weights or parameters reliably i.e. we will not have enough training examples for each pair of features in order for each weight to be reliable.

FM to the rescue

FM solves the problem of considering pairwise feature interactions. It allows us to train, based on reliable information (latent features) from every pairwise combination of features in the model. FM also allows us to do this in an efficient way both in terms of time and space complexity. It models pairwise feature interactions as the dot product of low dimensional vectors(length = k). This is illustrated with the following equation for a degree = 2 factorization machine: Each parameter in FMs (k=3) can be described as follows: Here, for each term we have calculated the dot product of the 2 latent factors of size 3 corresponding to the 2 features.

From a modeling perspective, this is powerful because each feature ends up transformed to a space where similar features are embedded near one another. In simple words, the dot product basically represents similarity of the hidden features and it is higher when the features are in the neighborhood.

The cosine function is 1 (maximum) when theta is 0 and decreases to -1 when theta is 180 degrees. It is clear that the similarity is maximum when theta approaches 0.

Another big advantage of FMs is that we are able to compute the term that models all pairwise interactions in linear time complexity using a simple mathematical manipulation to the above equation. If you want to have a look at the exact steps required for this, please refer to the original Factorization Machines research paper at this link.

Example: Demonstration of how FM is better than POLY2

Consider the following artificial Click Through Rate (CTR) data:

This is a dataset comprising of sports websites as publishers and sports gear brands as publishers. The ad appears as a popup and the user has an option of clicking (clicks)the ad or closing it (unclicks).

There is only one negative training data for the pair (ESPN, Adidas). For Poly2, a very negative weight w_ESPN,Adidas might be learned for this pair. For FMs, because the prediction of (ESPN, Adidas) is determined by w_ESPN·w_Adidas, and because w_ESPN and w_Adidas are also learned from other pairs of features as well (e.g., (ESPN, Nike), (NBC, Adidas)), the prediction may be more accurate.
Another example is that there is no training data for the pair (NBC, Gucci). For Poly2, the prediction on this pair is 0, but for FMs, because w_NBC and w_Gucci can be learned from other pairs, it is still possible to do meaningful prediction.

Field-Aware Factorization Machines

Clicked	Publisher (P)	Advertisor (A)	Gender (G)
Yes	ESPN	Nike	Male

In order to understand FFMs, we need to realize the meaning of field. Field is typically the broader category which contains a particular feature. In the above training example, the fields are Publisher (P), Advertiser (A) and Gender(G).

In FMs, every feature has only one latent vector v to learn the latent effect with any other features. Take ESPN as an example, w_ESPN is used to learn the latent effect with Nike (w_ESPN·w_Nike) and Male (w_ESPN.w_Male).
However, because ESPN and Male belong to different fields, the latent effects of (ESPN, Nike) and (ESPN, Male) may be different. This is not captured by factorization machines as it will use the same parameters for dot product in both cases.
In FFMs, each feature has several latent vectors. For example, when we consider the interaction term for ESPN and Nike, the hidden feature for ESPN would have the notation w_ESPN,A where A(Advertiser) represents the field for the feature Nike. Similarly for Gender field a different parameter w_ESPN,G would be learnt.

FFMs have proved to be vital for winning the first prize of three CTR (Click through Rate) competitions hosted by Criteo, Avazu, Outbrain, it also won the third prize of RecSys Challenge 2015. Datasets for the CTR can be accessed from Kaggle.

Implementation using xLearn Library in Python

Some of the most popular libraries for its implementation in Python are as follows:

Package Name	Description
LibFM	Earliest library by the author himself for FMs
LibFFM	Library exclusively FFMs
xLearn	Scalable ML package for both FM & FFMs
tffm	Tensorflow implementation of arbitrary order FMs

For using FMs on datasets, it needs to be converted to a specific format called the libSVM format. The format of training and testing data file is:

<label> <feature1>:<value1> <feature2>:<value2> …

In case of a categorical field, the feature is uniquely encoded and a value of 1 is assigned to it. In the above figure ESPN is represented by code 1, Nike is represented by code 2 and so on. Each line contains an equivalent training example and is ended by a ‘\n’ or a new line character.

For classification(binary/multiclass), <label> is an integer indicating the class label.
For regression, <label> is the target value which can be any real number.
Labels in the test file are only used to calculate accuracy or errors. If they are unknown, you can just fill the first column with any number.

Similarly for FFMs, the data needs to be transformed to a libffm format. Here, we also need to encode the field since ffm requires the information of field for learning. The format for the same is:

<label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> …..

Important note on numerical features

Numerical features either need to be discretized (transformed to categorical features by breaking the entire range of a particular numerical feature into smaller ranges and label encoding each range separately) and then converted to libffm format as described above.

Another possibility is to add a dummy field which is the same as feature value will be numeric feature for that particular row (For example a feature with value 45.3 can be transformed to 1:1:45.3). However, the dummy fields may not be informative because they are merely duplicates of features.

xLearn

Recently launched xLearn library provides a fast solution to implementing FM and FFM models on a variety of datasets. It is much faster than libfm and libffm libraries and provide a better functionality for model testing and tuning.

Here, we will illustrate with an example of FFM for the loan prediction dataset which can be accessed at the Loan Prediction practice problem.

The problem to identify the customers segments eligible for loan amount so that they can specifically target these customers based on some demographic and credit history variables.

First step is to import necessary libraries

import pandas as pd
import xlearn as xl
train = pd.read_csv('train_u6lujuX_CVtuZ9i.csv')
import warnings
warnings.filterwarnings('ignore')

For simplicity we will just take a few variables here:

cols = ['Education','ApplicantIncome','Loan_Status','Credit_History']
train_sub = train[cols]
train_sub['Credit_History'].fillna(0, inplace = True)
dict_ls = {'Y':1, 'N':0}
train_sub['Loan_Status'].replace(dict_ls, inplace = True)

Next we will create a test set for testing the ffm model

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(train_sub, test_size = 0.3, random_state = 5)

Next, we need to convert the dataset to libffm format which is necessary for xLearn to fit the model. Following function does the job of converting dataset in standard dataframe format to libffm format.

df = Dataframe to be converted to ffm format

Type = ‘Train’ / ‘Test’/ ‘Valid’

Numerics = list of all numeric fields

Categories = list of all categorical fields

Features = list of all features except the Label and Id

def convert_to_ffm(df,type,numerics,categories,features):
    currentcode = len(numerics)
    catdict = {}
    catcodes = {}
    
    # Flagging categorical and numerical fields
    for x in numerics:
         catdict[x] = 0
    for x in categories:
         catdict[x] = 1
    
    nrows = df.shape[0]
    ncolumns = len(features)
    with open(str(type) + "_ffm.txt", "w") as text_file:
    
    # Looping over rows to convert each row to libffm format
        for n, r in enumerate(range(nrows)):
             datastring = ""
             datarow = df.iloc[r].to_dict()
             datastring += str(int(datarow['Loan_Status'])) # Set Target Variable here
             
            # For numerical fields, we are creating a dummy field here
             for i, x in enumerate(catdict.keys()):
                 if(catdict[x]==0):
                     datastring = datastring + " "+str(i)+":"+ str(i)+":"+ str(datarow[x])
                 else:
            
            # For a new field appearing in a training example
                     if(x not in catcodes):
                         catcodes[x] = {}
                         currentcode +=1
                         catcodes[x][datarow[x]] = currentcode #encoding the feature
             
            # For already encoded fields
                     elif(datarow[x] not in catcodes[x]):
                         currentcode +=1
                         catcodes[x][datarow[x]] = currentcode #encoding the feature
                     
                     code = catcodes[x][datarow[x]]
                     datastring = datastring + " "+str(i)+":"+ str(int(code))+":1"

             datastring += '\n'
             text_file.write(datastring)

xLearn can handle csv as well as libsvm format for implementation of FMs while we necessarily need to convert it to libffm format for using FFM.

Once we have the dataset in libffm format, we could train the model using the xLearn library.

xLearn can automatically performs early stopping using the validation/test logloss and we can also declare another metric and monitor on the validation set for each iteration of the stochastic gradient descent.

The following python script could be used for training and tuning hyperparameters of FFM model using xlearn on a dataset in ffm format. More options and complete documentation is given here.

ffm_model = xl.create_ffm()

ffm_model.setTrain("train_ffm.txt")

param = {'task':'binary', 
         'lr':0.2,
         'lambda':0.002, 
         'metric':'acc'}

# Start to train
# The trained model will be stored in model.out
ffm_model.fit(param, './model.out')

The library also allows us to use cross-validation using the cv() function:

ffm_model.cv(param)

Predictions can be done on the test set with the following code snippet:

# Prediction task
ffm_model.setTest("test_ffm.txt") # Test data
ffm_model.setSigmoid() # Convert output to 0-1

# Start to predict
# The output result will be stored in output.txt
ffm_model.predict("./model.out", "./output.txt")

End Notes

In this article we have demonstrated the usage of factorization for normal classification/Regression problems. Please let us know how this algorithm performed for your problem. Detailed documentation for xlearn is given at this link and has regular support by its contributors.

Ankit

Algorithm Classification Intermediate Machine Learning Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

George Stefanopoulos

Nice and informative article! Here is a paper on the theoretical foundations on Factorization Machines by Steffen Rendel from Osaka University for further reading, : https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf

Show 1 reply

Ankit Choudhary

Thanks George!

David Olmo

this was superb!! thank for the explanation, helped me a lot :)

Thanks David!

Dr Pavan Kumar

Good article to learn about PLS (Principles of Least squares) with the help of Matrices. As a pure statistician, I am not finding any difference from OLSR technique to this methodology except providing a way to have clear understanding by novances.

Thanks Pavan!

Sayak Paul

The article is really informative. I came to know about FMs and FFMs with this one. I wanted know what are author's motivation behind incorporating FMs and FFMs. More specifically, what are some other relevant techniques that can be equally incorporated?

Ashish Tripathy

Thanks for the complete explanation. I remember struggling with it during the AV click prediction competition. In the same lines, If i have around 300 variables in a sparse dataset of products each variable showing a product's profit earned for each user. Can i use first eg. to actually carry out a clustering over the data set?

Hi Ashish, Can you elaborate? I don't understand what you want to do?

jatinpal singh

Nice explanation at the start and informed about xlearn

Thanks Jatinpal!

Anshul

Nice read and good knowledge. According to statistics it is machine learning market will be close to $23 BN by 2020

Mayank

Should all the variables be categorical or every variable one-hot encoded? For e.g.: A variable, say id, should it be one hot encoded or just converted to type "category" and then feeded to the model?

Shivam

Good👍

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introductory Guide – Factorization Machines & their application on huge datasets (with codes in Python)

Introduction

Table of Contents

Intuition behind Factorization

How Factorization Machines trump Polynomial and linear models?

FM to the rescue

Example: Demonstration of how FM is better than POLY2

Field-Aware Factorization Machines

Implementation using xLearn Library in Python

Important note on numerical features

xLearn

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid