Save Machine Learning Model Using Pickle and Joblib

Purnendu Last Updated : 29 Nov, 2024

6 min read

Suppose you are working on a practice problem related to house rent given lots of data points and input features. It’s quite common to perform EDA, Preprocessing(may need to create additional features), and feeding our data to our model. In this scenario even if we use the simplest Linear Regression Model (multiple variables) it may become huge in size due to all the input_features and all the parameters which will be time-consuming to re-train again and again for use.

So the simplest thing to do is to save our model and later load it for inference or prediction at a later time. While Keras models API provides the model.save() functionality for saving our deep learning model is limited to the realm of deep learning and for most beginners, in ML it’s quite confusing to save their model. Also due to estimators having a huge number of parameters, it is quite advisable to save them.

In this article, you will learn how to save a model to pickle using Python. We will explore the differences between joblib vs pickle for model serialization and provide a step-by-step guide on how to save a model in a pickle file effectively.

This article was published as a part of the Data Science Blogathon

Loading Dataset And Creating Our Model
- Creating Model Files
Saving Model
- Method 1 – Pickle – 2 Steps
Frequently Asked Questions

Loading Dataset And Creating Our Model

We are going to use a house price prediction dataset with a single feature area(for demonstration purposes). Our job will be to predict the price given the area. For keeping things simple we will have only 4-5 data points and the model we will be using will be a Linear Regression Model which just fits a straight line to our dataset and calculates the square of predicted difference from actual differences over all data points*

The square in cost function ensures that negative values are nullified

Creating Model Files

We are now quickly going to create our model file in 5 steps which we will be saving for later use.

1. We will start by loading all the required dependencies.

# loading dependencies
import pandas as pd
import numpy as np
from sklearn import linear_model

2. Now we will be loading our data using pd.read_csv() function into a pandas dataframe(train_df) and use df.head() method to print first 5 rows.

# loading dependencies
import pandas as pd
import numpy as np
from sklearn import linear_model

# loading our data
train_df = pd.read_csv('train.csv')
# viewing few files
print(train_df.head())

# creating the model object
model = linear_model.LinearRegression() # y = mx+b
# fitting model with X_train - area, y_train - price
print(model.fit(train_df[['area']],train_df.price))

3. To create our model we will be first creating a model object which will be actually a LinearRegression classifier and then fit our model with our training samples and training labels for which our model job will be to find the best straight line fit.

# creating the model object
model = linear_model.LinearRegression() # y = mx+b

# fitting model with X_train - area, y_train - price
model.fit(train_df[['area']],train_df.price)

After executing the above code output will look a bit like this

>> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Checkout this article about the Machine learning Algorithms

4. As we know a straight line has a coefficient and an intercept in the equation, so we should check out those values as sklearn provides some handy attributes. These can be checked as

# checking coeffiecent - m
model.coef_

>> array([135.78767123])

# checking intercept - b
model.intercept_

>> 180616.43835616432

5. Finally for completeness sake one can test the model for predicting the price for a 5000sqft area house.

# predict model values - area = 5000
model.predict([[5000]])

>> array([859554.79452055])

Saving Model

It’s now time to save our created model. We are going to look into 2 quick hacks for the saving model. Also as a bonus, I will be providing guidelines on where to use which method.

Method 1 – Pickle – 2 Steps

Many of you will be familiar with the pickle module, however, if not it’s good to know that the pickle module allows you to pickle a file using de-serialization which means simply breaking down an object into its constituting components. For e.g, our model files attribute like the one we saw.

To save a file using pickle one needs to open a file, load it under some alias name and dump all the info of the model. This can be achieved using below code:

# loading library
import pickle

# create an iterator object with write permission - model.pkl
with open('model_pkl', 'wb') as files:
    pickle.dump(model, files)

After the above steps, one can see a file with the name model_pkl in the directory, and opening it will show something like this:

Directory As Shown In Google Collab

inside project Save Machine Learning Model — inside model_pkl file

One can load this file back again into a model using the same logic, here we are using the lr variable for referencing the model and then using it to predict the price for 5000sqft:

# load saved model
with open('model_pkl' , 'rb') as f:
    lr = pickle.load(f)

# check prediction
lr.predict([[5000]]) # similar

>> array([859554.79452055])

Benefits:

The pickle module tracks the objects it has serialized, so later references to the same object won’t serialize again, allowing for faster execution.
Allows saving model in very little time.
Good For small models with fewer parameters like the one we used.

Clear your understanding upon Pearson vs Spearman Correlation Coefficients

Method 2 – Joblib – 2 Steps

Joblib is an alternative to model saving in a way that it can operate on objects with large NumPy arrays/data as a backend with many parameters. It can be used as an individual module(refer here) or using the Sci-Kit Learn library. For simplicity’s sake, we will be using the second method.

-> First, we will import joblib from sklearn’s external class

# loading dependency
from sklearn.externals import joblib

To save the model we will use its dump functionality to save the model to the model_jlib file.

# saving our model # model - model , filename-model_jlib
joblib.dump(model , 'model_jlib')

After running the above code a file will be created with a filename and contents will be similar to the pickle file.

The directory

Note: We didn’t use an iterator as the module saves the data onto disk rather than string-names. However, it accepts file-like objects.

To load the model we will be providing file-path or file object to the load function and storing it in the m_jlib variable, which we can later use for prediction.

# opening the file- model_jlib
m_jlib = joblib.load('model_jlib')

Finally for predicting we can call predict method on m_jlib and pass it a 2d array with values as 5000.

# check prediction
m_jlib.predict([[5000]]) # similar

>> array([859554.79452055])

Note predict methods assumes you provide data in a 2d format so we used [[5000]] meaning 5000 as an 2d array

Benefits:

Ideal for the large models having many parameters and can have large NumPy arrays in the backend.
Can only save the file to disk and not to a string.
Works similar to pickle `dump` and `load`
Most fitted for sklearn estimators.

Conclusion

Due to the time complexity involved in training large models, saving is becoming a crucial part of the data-science realm. In this article, I introduced a few Quick Hacks To Save Machine Learning Model using Pickle and Joblib. Both processes work on the same concept of serialization (saving of data into its component form) and deserialization (restoring data from the serialized chunks). Therefore, always pickle or joblib the model from a trusted source.

For simplicity, we have used a Linear Regression model, but you can apply the same approach to save different types of models like Logistic Regression, Decision Trees, SVMs, and many more.

Hope you have enjoyed reading the article and learned something in the process. Those who want to dive deeper can refer to the reference section and work along.

Frequently Asked Questions

Q1. What is sklearn save model?

A. Scikit-learn (sklearn) is a popular machine learning library for Python. To save a trained sklearn model, you can use the “joblib” module, which is part of the sklearn library.
The “joblib” module provides a simple way to save and load Python objects, including trained sklearn models. Saving the model enables you to reuse the model for making predictions on new data, without having to retrain the model from scratch.
To save a trained sklearn model using joblib, you can use the “dump” function, which takes two arguments: the trained model object and the filename for saving the model.

Q2. How do I save my machine learning model?

A. You can save your machine learning model using libraries like pickle or joblib in Python. This helps preserve the model for future use without retraining.

Q3. How to save a machine learning model using pickle?

A. To save a model using pickle, open a file in write-binary mode (wb) and use pickle.dump() to save.
import pickle
with open(‘model.pkl’, ‘wb’) as file:
pickle.dump(model, file)

Q4. How do you save a model in Python?

A. In Python, you can save models using pickle or joblib. For example, with pickle, use:
import pickle
pickle.dump(model, open(‘model.pkl’, ‘wb’))

Purnendu

A dynamic and enthusiastic individual with a proven track record of delivering high-quality content around Data Science, Machine Learning, Deep Learning, Web 3.0, and Programming in general.

Here are a few of my notable achievements👇

🏆 3X times Analytics Vidhya Blogathon Winner under guides category.

🏆 Stackathon by Winner Under Circle API Usage Category - My Detailed Guide

🏆 Google TensorFlow Developer ( for deep learning) and Contributor to Open Source

🏆 A Part Time Youtuber - Programing Related content coming every week!

Feel free to contact me if you wanna have a conversation on Data Science, AI Ethics & Web 3 / share some opportunities.

Free Courses

4.8

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

4.5

Bagging and Boosting ML Algorithms

Explore Bagging and Boosting to understand advanced ML algorithms.

4.5

Naive Bayes from Scratch

Master Naïve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

4.9

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Reading list

Save Machine Learning Model Using Pickle and Joblib

Table of contents

Loading Dataset And Creating Our Model

Creating Model Files

Saving Model

Method 1 – Pickle – 2 Steps

Method 2 – Joblib – 2 Steps

The directory

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Ensemble Learning and Ensemble Learning Techniques

Bagging and Boosting ML Algorithms

Naive Bayes from Scratch

Dimensionality Reduction for Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Save Machine Learning Model Using Pickle and Joblib

Table of contents

Loading Dataset And Creating Our Model

Creating Model Files

Saving Model

Method 1 – Pickle – 2 Steps

Method 2 – Joblib – 2 Steps

The directory

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Ensemble Learning and Ensemble Learning Techniques

Bagging and Boosting ML Algorithms

Naive Bayes from Scratch

Dimensionality Reduction for Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques