Raghav Agrawal — June 27, 2021
Advanced Data Science Pandas Python python Recommendation Unsupervised

This article was published as a part of the Data Science Blogathon

A recommendation system is one of the top applications of data science. Every consumer Internet company requires a recommendation system like Netflix, Youtube, a news feed, etc. What you want to show out of a huge range of items is a recommendation system.

Recommendation system image


Table of Contents

  • Introduction to a Recommendation system
  • Types of Recommendation system
  • Book Recommendation System
    • Content-Based Filtering
    • Collaborative-based filtering
    • Hybrid filtering
  • Hands-on Recommendation system
    • Dataset Description
    • Preprocess data
    • Perform EDA
    • Clustering
    • Predictions
  • End Notes

What actually is Recommendation System

A recommendation engine is a class of machine learning which offers relevant suggestions to the customer.  Before the recommendation system, the major tendency to buy was to take a suggestion from friends. But Now Google knows what news you will read, Youtube knows what type of videos you will watch based on your search history, watch history, or purchase history.

A recommendation system helps an organization to create loyal customers and build trust by them desired products and services for which they came on your site. The recommendation system today are so powerful that they can handle the new customer too who has visited the site for the first time. They recommend the products which are currently trending or highly rated and they can also recommend the products which bring maximum profit to the company.

Types Of Recommendation System

A recommendation system is usually built using 3 techniques which are content-based filtering, collaborative filtering, and a combination of both.

1) Content-Based Filtering

The algorithm recommends a product that is similar to those which used as watched. In simple words, In this algorithm, we try to find finding item look alike. For example, a person likes to watch Sachin Tendulkar shots, so he may like watching Ricky Ponting shots too because the two videos have similar tags and similar categories.

Only it looks similar between the content and does not focus more on the person who is watching this. Only it recommends the product which has the highest score based on past preferences.

2) Collaborative-based Filtering

Collaborative based filtering recommender systems are based on past interactions of users and target items.  In simple words here, we try to search for the look-alike customers and offer products based on what his or her lookalike has chosen. Let us understand with an example. X and Y are two similar users and X user has watched A, B, and C movie. And Y user has watched B, C, and D movie then we will recommend A movie to Y user and D movie to X user.

Youtube has shifted its recommendation system from content-based to Collaborative based filtering technique. If you have experienced sometimes there are also videos which not at all related to your history but then also it recommends it because the other person similar to you has watched it.

3) Hybrid Filtering Method

It is basically a combination of both the above methods. It is a too complex model which recommends product based on your history as well based on similar users like you.

There are some organizations that use this method like Facebook which shows news which is important for you and for others also in your network and the same is used by Linkedin too.

Book Recommendation System

A book recommendation system is a type of recommendation system where we have to recommend similar books to the reader based on his interest. The books recommendation system is used by online websites which provide ebooks like google play books, open library, good Read’s, etc.

In this article, we will use the Collaborative based filtering method to build a book recommender system. You can download the dataset from here

Practical Implementation of Recommendation System

Let’s make our hands dirty while trying to implement a Book recommendation system using collaborative filtering.

Dataset Description

we have 3 files in our dataset which is extracted from some books selling websites.

  • Books – first are about books which contain all the information related to books like an author, title, publication year, etc.
  • Users – The second file contains registered user’s information like user id, location.
  • ratings –  Ratings contain information like which user has given how much rating to which book.

So based on all these three files we can build a powerful collaborative filtering model. let’s get started.

Load Data

let us start while importing libraries and load datasets. while loading the file we have some problems like.

  •  The values in the CSV file are separated by semicolons, not by a comma.
  • There are some lines which not work like we cannot import it with pandas and It throws an error because python is Interpreted language.
  • Encoding of a file is in Latin

So while loading data we have to handle these exceptions and after running the below code you will get some warning and it will show which lines have an error that we have skipped while loading.

import numpy as np
import pandas as pd
books = pd.read_csv("BX-Books.csv", sep=';', encoding="latin-1", error_bad_lines=False)
users = pd.read_csv("BX-Users.csv", sep=';', encoding="latin-1", error_bad_lines=False)
ratings = pd.read_csv("BX-Book-Ratings.csv", sep=';', encoding="latin-1", error_bad_lines=False)

Preprocessing Data

Now in the books file, we have some extra columns which are not required for our task like image URLs. And we will rename the columns of each file as the name of the column contains space, and uppercase letters so we will correct as to make it easy to use.

books = books[['ISBN', 'Book-Title', 'Book-Author', 'Year-Of-Publication', 'Publisher']]
books.rename(columns = {'Book-Title':'title', 'Book-Author':'author', 'Year-Of-Publication':'year', 'Publisher':'publisher'}, inplace=True)
users.rename(columns = {'User-ID':'user_id', 'Location':'location', 'Age':'age'}, inplace=True)
ratings.rename(columns = {'User-ID':'user_id', 'Book-Rating':'rating'}, inplace=True)

Now if you see the head of each dataframe you will able to see something like this.

data top rows | Recommendation system

The dataset is reliable and can consider as a large dataset. we have 271360 books data and total registered users on the website are approximately 278000 and they have given near about 11 lakh rating. hence we can say that the dataset we have is nice and reliable.

Approach to a problem statement

We do not want to find a similarity between users or books. we want to do that If there is user A who has read and liked x and y books, And user B has also liked this two books and now user A has read and liked some z book which is not read by B so we have to recommend z book to user B. This is what collaborative filtering is.

So this is achieved using Matrix Factorization, we will create one matrix where columns will be users and indexes will be books and value will be rating. Like we have to create a Pivot table.

A big flaw with a problem statement in the dataset

If we take all the books and all the users for modeling, Don’t you think will it create a problem? So what we have to do is we have to decrease the number of users and books because we cannot consider a user who has only registered on the website or has only read one or two books. On such a user, we cannot rely to recommend books to others because we have to extract knowledge from data. So what we will limit this number and we will take a user who has rated at least 200 books and also we will limit books and we will take only those books which have received at least 50 ratings from a user.

Exploratory Data Analysis

So let’s get with analysis and prepare the dataset as we discussed for modeling. let us see how many users have given ratings and extract those users who have given more than 200 ratings.


Step-1) Extract users and ratings of more than 200

when you run the above code we can see only 105283 peoples have given a rating among 278000. Now we will extract the user ids who have given more than 200 ratings and when we will have user ids we will extract the ratings of only this user id from the rating dataframe.

x = ratings['user_id'].value_counts() > 200
y = x[x].index  #user_ids
ratings = ratings[ratings['user_id'].isin(y)]

step-2) Merge ratings with books

So 900 users are there who have given 5.2 lakh rating and this we want. Now we will merge ratings with books on basis of ISBN so that we will get the rating of each user on each book id and the user who has not rated that book id the value will be zero.

rating_with_books = ratings.merge(books, on='ISBN')
merging | Recommendation system

step-3) Extract books that have received more than 50 ratings.

Now dataframe size has decreased and we have 4.8 lakh because when we merge the dataframe, all the book id-data we were not having. Now we will count the rating of each book so we will group data based on title and aggregate based on rating.

number_rating = rating_with_books.groupby('title')['rating'].count().reset_index()
number_rating.rename(columns= {'rating':'number_of_ratings'}, inplace=True)
final_rating = rating_with_books.merge(number_rating, on='title')
final_rating = final_rating[final_rating['number_of_ratings'] >= 50]
final_rating.drop_duplicates(['user_id','title'], inplace=True)

we have to drop duplicate values because if the same user has rated the same book multiple times so it will create a problem. Finally, we have a dataset with that user who has rated more than 200 books and books that received more than 50 ratings. the shape of the final dataframe is 59850 rows and 8 columns.

Step-4) Create Pivot Table

As we discussed above we will create a pivot table where columns will be user ids, the index will be book title and the value is ratings. And the user id who has not rated any book will have value as NAN so impute it with zero.

book_pivot = final_rating.pivot_table(columns='user_id', index='title', values="rating")
book_pivot.fillna(0, inplace=True)

We can see the more than 11 users have removed out because their ratings were on those books which do not receive more than 50 ratings so they are moved out of the picture.


We have prepared our dataset for modeling. we will use the nearest neighbors algorithm which is the same as K nearest which is used for clustering based on euclidian distance.

But here in the pivot table, we have lots of zero values and on clustering, this computing power will increase to calculate the distance of zero values so we will convert the pivot table to the sparse matrix and then feed it to the model.

from scipy.sparse import csr_matrix
book_sparse = csr_matrix(book_pivot)

Now we will train the nearest neighbors algorithm. here we need to specify an algorithm which is brute means find the distance of every point to every other point.

from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(algorithm='brute')

Let’s make a prediction and see whether it is suggesting books or not. we will find the nearest neighbors to the input book id and after that, we will print the top 5 books which are closer to those books. It will provide us distance and book id at that distance. let us pass harry potter which is at 237 indexes.

distances, suggestions = model.kneighbors(book_pivot.iloc[237, :].values.reshape(1, -1))

let us print all the suggested books.

for i in range(len(suggestions)):
print data | Recommendation system

hence, we have successfully built a book recommendation system.

End Notes

Hurray! We have to build a reliable Book Recommendation system And you can further modify it and convert it to an end-end project. This is a wonderful Unsupervised learning project where we have done lots of preprocessing and you can explore the dataset more and if you find something more interesting please share it in the comment box.

I hope it was easy to catch up with each method and follow along with the article, If you have any queries please post them in the comment section below. I will be happy to help you with any queries.

About the Author

Raghav Agrawal

I am pursuing my bachelor’s in computer science. I am very fond of Data science and big data. I love to work with data and learn new technologies. Please feel free to connect with me on Linkedin.

If you like my article, please give it a read to others too. link

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Ram Dewani
  • Faizan Shaikh
  • Aniruddha Bhandari

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *