Aman Preet Gulati — Updated On June 23rd, 2022
Beginner Libraries Linear Regression Machine Learning Python Spark

This article was published as a part of the Data Science Blogathon.

Introduction to Linear Regression 

In this article we will be learning about the Linear Regression using MLIB and everything will be hands-on i.e. we will be building an end to end Linear regression model which will predict the customer’s yearly spend on the company’s product if we talk about the dataset so it is completely a dummy dataset which is generated in purpose to understand the concepts of model building for continuous data using “MLIB”.

linear regression


Mandatory Steps for Linear Regression using MLIB

Before getting into the machine learning process and following the steps to predict the customer’s yearly spending we must need to initialize the Spark Session and read our dummy dataset of e-commerce websites that have all the relevant features.

  1. Initializing the Spark Session
  2. Reading the dataset

Setting up the spark session

In this particular section, we will setup up the Spark object so that we will be able to create an environment to perform the operations which are supported and managed by it.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('E-commerce').getOrCreate()

Inference: So from the above two code lines we have successfully imported the SparkSession object from PySpark’s SQL package and then we have created the environment using the getOrCreate() function one thing to note is that before creating it we have built it using the builder function and given it the name as “E-commerce”

Reading the dataset

In this section, we will be reading the dummy dataset which I’ve created to perform the ML operations along with Data Preprocessing using PySpark.

data ="Ecommerce_Customers.csv",inferSchema=True,header=True)

Inference: So in the above line of code we have read the Ecommerce data and kept the inferSchema parameter as True so that it will return the real data type that which dataset possesses and the header as True so that the first tuple of record will be stated as header.

Showing the Schema of our dataset

Here the Schema of the dataset will be shown so that one could get the inference of what kind of data each column holds and then the analysis could be done with more precision.



output | linear regression

Inference: So we have used the printSchema() function to show the information about each column that our dataset holds and while looking at the output one can see what kind of data type is there.

Now we will go through the dataset using three different ways so that one could also know all the methods to investigate it.

  1. show() function
  2. head() function
  3. Iterating through each item

Looking at the data using the show() function where it will return the top 20 rows from the complete data.

linear regression

Now the head function needs to be introduced which is quite similar to the head function used in pandas in the below code’s output we can see that the head function returned the Row object which holds one complete record/tuple.



Row(Email='[email protected]', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)

Now let’s see the more clear version of getting into the data where each item will be iterable through the combination of for loop and head function and the output shown is the more clear version of the Row object output.

for item in data.head():


[email protected]
835 Frank TunnelWrightmouth, MI 82180-9605

Importing Linear Regression Library

As mentioned earlier that we will gonna predict the customer’s yearly expenditure on products so based on what we already know, we have to deal with continuous data and when we are working with such type of data we have to use the linear regression model.

For that reason, we will be importing the Linear Regression package from the ML library of PySpark.

from import LinearRegression

Data Preprocessing for Machine Learning

In this section, all the data preprocessing techniques will be performed which are required to make the dataset ready to be sent across the ML pipeline where the model could easily adapt and build an efficient model.

Importing Vector and VectorAssembler libraries so that we could easily separate the features columns and the Label column i.e. all the dependent columns will be stacked together as the feature column and the independent column will be as a label column.

from import Vectors
from import VectorAssembler

Let’s have a look at which columns are present in our dataset.



Inference: So from the above output all the columns are listed down in the form of list type only but this will not give us enough information about which column to select hence for that reason we will use the describe method.



DataFrame[summary: string, Email: string, Address: string, Avatar: string, Avg Session Length: string, Time on App: string, Time on Website: string, Length of Membership: string, Yearly Amount Spent: string]

Inference: If you will go through the output closely you will find that columns that have a string as the data type will have no role in the model development phase as machine learning is the involvement of mathematical calculation where only number game is allowed hence integer and double data type columns accepted.

Based on the above discussion the columns which are selected to be part of the machine learning pipeline are as follows:

  1. Average Session Length
  2. Time on App
  3. Time on Website
  4. Length of Membership
assembler = VectorAssembler(
    inputCols=["Avg Session Length", "Time on App", 
               "Time on Website",'Length of Membership'],


Inference: In the above code we chose the VectorAssembler method to stack all our features columns together and return them as the “features” columns by the output column parameter.

output = assembler.transform(data)

Here, the Transform function is used to fit the real data with the changes that we have done in the assembler variable using the VectorAssembler function so that the changes should reflect in the real dataset."features").show()


Now with the select function, we have selected only the features column from the dataset and showed it in the form of DataFrame using the show() function.

final_data ="features",'Yearly Amount Spent')

From the above code, we are concatenating the stack of dependent features (named as features) and independent features together and naming it final_data and this frame will be analyzed further in the process.

Train Test Split

In this step of the model building, we will be dividing our data into a training set and the testing set, where training data will be the one on top of which our model will be built and on the other hand testing data is the one on which we will test our model that how well it performed.

In MLIB, for dividing the data into testing and training sets we have to use a random split() function which takes an input in the form of the list type.

train_data,test_data = final_data.randomSplit([0.7,0.3])

Inference: With the help of the tuple unpacking concept we have stored the training set (70%) into train_data and similarly 30% of the dataset into test_data. Note that in the random split() method the list is passed.



train test spilt | linear regression


output 3

Inference: Describe method seems to be an accurate way to analyze and draw the difference between training and testing data where we can see that in the training set there are 349 records while 151 are on the other hand.

Model Development

Finally, we have come across the step where we will be building our Linear Regression Model and for that LinearRegression object is used which if you remember we have imported in the starting and then passed the “Yearly Amount Spent” column in the label Column parameter which is our independent column.

lr = LinearRegression(labelCol='Yearly Amount Spent')

Now, as we have created our Linear Regression object so now we can easily fit our data i.e. we can do the model training by passing the training data in the fit method.

lrModel =,)

Now, let’s print the Coefficients of each feature and intercepts of the model which is being trained on the training dataset this is also one of the pieces of information which will let you know how well your model is involving each independent variable separately.

print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))


Coefficients: [25.324513354618116,38.880247333555445,0.20347373150823037,61.82593066961652] Intercept: -1031.8607952442187

Model Evaluation

So in this step, we will be evaluating our model i.e. We will analyze how well our model performed, and in this stage of the model building, we decide whether to go with the existing one or not in the model deployment stage.

So for evaluation, we have come across the “evaluate” function and stored it in the test_results variable as we will use it for further analysis.

test_results = lrModel.evaluate(test_data)

The one who knows the mathematical intuition behind Linear Regression must be aware of the fact that residual = Original result – Predicted result i.e. the difference between the predicted result by the model and the original result of the label column.


model evaluation output

Now it’s time to make predictions from our model for that we will first store the unlabelled data i.e the feature data and transform it too so that changes will take place.

unlabeled_data ='features')

predictions = lrModel.transform(unlabeled_data)


output 4

Inference: So from the above output we can see that it returned a DataFrame that practically has two columns one is the complete stack of features column and the other one is the prediction column.


So, in this section we will see by far what we have learned in this article if I have to mention it in the nutshell then we have gone through a complete machine learning pipeline for the linear regression algorithm.

  1. We started the spark session and read the dataset on top of which everything was performed.
  2. Then we performed each data preprocessing step which was required to make the data ready for an ML algorithm to accept.
  3. After Data cleaning we moved towards dividing the data and later towards the model building where we built a Linear regression model.
  4. In the end, we evaluated the model using relevant functions and predicted the results.

Here’s the repo link to this article. I hope you liked my article on Introduction to Linear Regression using MLIB. If you have any opinions or questions, then comment below.

Connect with me on LinkedIn for further discussion.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *