I came across this strategic virtue from Sun Tzu recently:
What does this have to do with a data science blog? This is the essence of how you win competitions and hackathons. You come into the competition better prepared than the competitors, execute quickly, learn, and iterate to bring out your best.
Last week, we published “Perfect Way to Build a Predictive Model in Less than 10 minutes using R“. Anyone can guess a quick follow-up to this article. Given the rise of Python in the last few years and its simplicity, it makes sense to have this tool kit ready for the Pythonists in the data science world. I will follow a similar structure as the previous article with my additional inputs at different stages of model building. These two articles will help you build your first predictive model faster and with better power. Most of the top data scientists and Kagglers quickly built and submitted their first effective model. This helps them get a head start on the leaderboard and provides a benchmark solution to beat.
Overview:
I always focus on investing quality time during the initial phase of the model building, like hypothesis generation/brainstorming session(s) / discussion(s) or understanding the domain. All these activities helped me relate to the problem, eventually leading me to design more robust business solutions. There are good reasons why you should spend this time up front:
This stage will need quality time, so I am not mentioning the timeline here; I recommend you make this a standard practice. It will help you to build better predictive models and result in less iteration of work at later stages. Let’s look at the remaining stages in the first model build with timelines:
P.S. This is the split of time spent only for the first model build
Let’s go through the process step by step (with estimates of time spent in each step):
In my initial days as a data scientist, data exploration used to take a lot of time. With time, I have automated many operations on the data. The benefits of automation are obvious, given that data prep takes up 50% of the work in building a first model. You can look at “7 Steps of Data Exploration ” for the most common data exploration operations.
Tavish mentioned in his article that the time taken to perform this task had been significantly reduced with advanced machine-learning tools coming into the race. Since this is our first benchmark model, we do away with feature engineering. Hence, the time you might need to do descriptive analysis is restricted to missing values and big features that are directly visible. In my methodology, you will need 2 minutes to complete this step (Assumption, 100,000 observations in data set).
The operations I perform for my first model include:
There are various ways to deal with it. For our first model, we will focus on the smart and quick techniques to build your first effective model (Tavish already discusses these in his article; I am adding a few methods)
With such simple data treatment methods, the time to treat data can be reduced to 3-4 minutes.
I recommend using any one of the GBM / Random Forest techniques, depending on the business problem. These two techniques are extremely effective in creating a benchmark solution. I have seen data scientists often use these two methods as their first model; in some cases, they act as a final model. This will take the maximum time (~4-5 minutes).
There are various methods to validate your model’s performance. I suggest you divide your train data set into Train and Validate (ideally 70:30) and build a model based on 70% of the train data set. Now, cross-validate it using 30% of the validated data set and evaluate the performance using an evaluation metric. This finally takes 1-2 minutes to execute and document.
The intent of this article is not to win the competition but to establish a benchmark for ourselves. Let’s look at the Python codes to perform the above steps and build your first model with a higher impact.
I assume you have done all the hypothesis generation first and are good with basic data science using Python. I am illustrating this with an example of a data science challenge. Let’s look at the structure:
Import required libraries and read, test, and train the data set. Append both.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
train=pd.read_csv('C:/Users/Analytics Vidhya/Desktop/challenge/Train.csv')
test=pd.read_csv('C:/Users/Analytics Vidhya/Desktop/challenge/Test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
The next step of the framework is not required in Python.
The next step in building predictive modeling in Python is to view the dataset’s column names and summary.
fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function
Identify the a) ID variables, b) Target variables, c) Categorical Variables, d) Numerical Variables, e) Other Variables
ID_col = ['REF_NO']
target_col = ["Account.Status"]
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(ID_col)-set(target_col)-set(data_col))
other_col=['Type'] #Test and Train Data set identifier
Identify the variables with missing values and create a flag for those
fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False
num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables
#Create a new variable for each variable having missing value with VariableName_NA
# and flag missing value with 1 and other with 0
for var in num_cat_cols:
if fullData[var].isnull().any()==True:
fullData[var+'_NA']=fullData[var].isnull()*1
The next step in building a predictive modeling in Python is to impute Missing values.
#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)
#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)
Create label encoders for categorical variables and split the data set to train & test, further split the train data set to Train and Validate
#create label encoders for categorical features
for var in cat_cols:
number = LabelEncoder()
fullData[var] = number.fit_transform(fullData[var].astype('str'))
#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))
train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']
train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]
Pass the imputed and dummy (missing values flags) variables into the modelling process. I am using the random forest to predict the class.
features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))
x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values
random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)
The next step in building predictive modeling in Python is to check performance and make predictions.
status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc
final_status = rf.predict_proba(x_test)
test["Account.Status"]=final_status[:,1]
test.to_csv('C:/Users/Analytics Vidhya/Desktop/model_output.csv',columns=['REF_NO','Account.Status'])
And Submit!
In conclusion, building a predictive model effectively hinges on a strategic approach that prioritizes data exploration, smart data treatment, and rapid model development. Focusing on an initial benchmark model sets a foundation for swift iteration and improvement in future modelling stages. This guide underscores the importance of breaking down complex modelling tasks into manageable steps, ultimately enabling you to build robust, accurate models more efficiently. With the step-by-step Python code provided, you’re well-equipped to implement these methods in your predictive modelling projects, gaining a competitive edge in data science competitions and real-world applications.
Happy modelling, and may your models be both insightful and impactful!
A. A prediction model in Python is a mathematical or statistical algorithm used to make predictions or forecasts based on input data. It utilizes machine learning or statistical techniques to analyze historical data and learn patterns, which can then be used to predict future outcomes or trends. Python provides various libraries and frameworks for building and deploying prediction models efficiently.
A. The three types of prediction are:
1. Classification: Predicting the class or category that an input belongs to based on training data with labelled examples.
2. Regression: Predicting a continuous numerical value as an output to find a relationship between input and target variables.
3. Time series forecasting: Predicting future values based on the patterns and trends observed in historical time series data.
A. A predictive model in Python is a statistical or machine learning algorithm designed to forecast outcomes based on data input. Using libraries like scikit-learn or TensorFlow, predictive models analyze historical data to learn patterns and make future predictions, supporting decisions across various domains.
A. A common example of a predictive model is a customer churn model. Here, the model predicts whether a customer will likely leave based on historical data, including usage, demographics, and past interactions, helping businesses proactively retain customers.
A. The best algorithm depends on the problem type. For structured data, Random Forest or Gradient Boosting works well. For text or image data, deep learning models like neural networks excel. Model selection should consider data size, problem complexity, and interpretability.
Hello I'm completely new, and I'm a bit lost. 1. By example, where I can find the train.csv and test.csv ? 2. This instruction "fullData.describe() #You can look at summary of numerical fields by using describe() function" ought to show me a resume of dataset but I can't see nothing. 3. When I try the code I get an error in line num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(ID_col)-set(target_col)-set(data_col)) because the data_col is not defined. It's an error ? Sorry for my silly question.
Beauuuuuuuutiful! Just what I was looking for; practical application. I can't wait to give it a try. Thank you
Hi Sunil, Thanks for the neat workflow, which I am sure will be helpful to many. But I couldnt get the logic behind encoding the target variable with LabelEncoder as well. How does it help in better prediction? Can you explain the same please? Thanks.
hi,sunil.. Can you tell me where i can download the 'challenge\train.csv' and 'challenge\test.csv' datasets? Thank you.
The line roc_auc = auc(fpr, tpr) is giving me an error when I tried to run the code. I do not see any auc() function being defined. Is it an inbuild function? Can you please explain?