Difference Between fit(), transform(), fit_transform() methods in Scikit-Learn (with Python Code)
This article was published as a part of the Data Science Blogathon.
“Consumer data will be the biggest differentiator in the next two to three years. Whoever unlocks the reams of data and uses it strategically will win”
Before going ahead, if we considered the life cycle of any data science project, then we know that there are certain steps that help us to develop any data science projects. We will discuss them in points:
- Exploratory Data Analysis (EDA) is used to analyze the datasets and by this, we summarize their main importance.
- Feature Engineering is the process of extract features from raw data with some domain knowledge.
- Feature Selection where we select those features that will give a high impact on the model.
- Model creation in this we create a machine learning model using suitable algorithms.
- Deployment where we deploy our ML model on the web.
If we considered the first 3 steps then it will probably more towards Data Preprocessing and Model Creation is more towards Model Training. So these are the two most important steps whenever we wanted to deployment any machine learning application.
Transformer In Sklearn
Scikit-learn has an object usually something called a Transformer. The use of a transformer is that it will be performing data preprocessing and feature transformation but in the case of model training, we have objects called models like linear regression, classification, etc… if we talk about the examples of Transformer-like StandardScaler which helps us to do feature transformation where it converts the feature with mean =0 and standard deviation =1, PCA, Imputer, MinMaxScaler, etc… then all these particular techniques have seen that we are doing some preprocessing on the input data will change the formate of data and that data will be used for model training
Suppose we take f1, f2, f3 and f4 feature where f1,f2,f3 are independent features and f4 is our dependent feature and we apply a standardization process in which it takes a feature F and converts into F’ by applying a formula of standardization, If you notice at this stage we take one input feature F and convert it into other input feature F’ itself So, in this condition we do Three difference operation:
Now, we will discuss how those following operations are different from each other.
Why they differ from each other
In the fit() method, where we use the required formula and perform the calculation on the feature values of input data and fit this calculation to the transformer. For applying the fit() method we have to use .fit() in front of the transformer object.
Suppose we initialize the StandardScaler object O and we do .fit() then what will it do that, it takes the feature F and it will just compute the mean (μ) and standard deviation (σ) of feature F. That has happened in the fit method.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # split training and testing data xtrain,xtest,ytrain,ytest= train_test_split( x,y, test_size=0.3, random_state=42 ) # creating object stand= StandardScaler() # fit data Fit= stand.fit(xtrain)
First, we have to split the dataset into training and testing subsets and after that, we apply a transformer to that data.
In the next step, we basically perform transform because it was the second operation on the transformer:
For changing the data we probably do transform, in the transform() method, where we apply the calculations that we have calculated in fit() to every data point in feature F. We have to use .transform() in front of a fit object because we transform the fit calculations.
We use the example that is used above section when we create an object of the fit method then we just put it in front of the .transform and transform method uses those calculations to transform the scale of the data points, and the output will we get is always in the form of sparse matrix or array.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # split training and testing data xtrain,xtest,ytrain,ytest= train_test_split( x,y, test_size=0.3, random_state=42 ) # creating object stand= StandardScaler() # fit data Fit= stand.fit(xtrain) # transform data x_scaled = Fit.transform(xtrain)
As you can see that the output of the transform is in the form of an array in which data points vary from 0 to 1.
notice: It will only perform when we want to do some kind of transformation on the input data.
This fit_transform() method is basically the combination of fit method and transform method, it is equivalent to fit().transform(). This method performs fit and transform on the input data at a single time and converts the data points. If we use fit and transform separate when we need both then it will decrease the efficiency of the model so we use fit_transform() which will do both the work.
Suppose, we create the StandarScaler object, and then we perform .fit_transform() then it will calculate the mean(μ) and standard deviation(σ) of the feature F at a time it will transform the data points of the feature F.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # split training and testing data xtrain,xtest,ytrain,ytest= train_test_split( x,y, test_size=0.3, random_state=42 ) stand= StandardScaler() Fit_Transform = stand.fit_transform(xtrain) Fit_Transform
This method output is the same as the output we obtain after applying the separate fit() and transform() method.