Perfect way to build a Predictive Model in less than 10 minutes

Tavish Srivastava 26 Jun, 2020
4 min read

Overview

  • Hackathons involve building predictive models in a short time span
  • The Data Preprocessing step takes up the most share while building a model
  • Other steps involve descriptive analysis, data modelling and evaluating the model’s performance

 

 

Introduction

In the last few months, we have started conducting data science hackathons. These hackathons are contests with a well defined data problem, which has be be solved in short time frame. They typically last any where between 2 – 7 days.

If month long competitions on Kaggle are like marathons, then these hackathons are shorter format of the game – 100 mts Sprint. They are high energy events where data scientists bring in lot of energy, the leaderboard changes almost every hour and speed to solve data science problem matters lot more than Kaggle competitions.

predictive model, ten minutes, fast

One of the best tip, I can provide to data scientists participating in these hackathons (or even in longer competitions) is to quickly build the first solution and submit. The first few submissions should be real quick. I have created modules on Python and R which can takes in tabular data and the name of target variable and BOOM! I have my first model in less than 10 minutes (Assuming your data has more than 100,000 observations). For smaller data sets, this can be even faster. The reason of submitting this super-fast solution is to create a benchmark for yourself on which you need to improve. I will talk about my methodology in this article.

 

Breaking Down the process of Predictive Modeling

To understand the strategic areas, let’s first break down the process of predictive analysis into its essential components. Broadly, it can be divided into 4 parts. Every component demands x amount of time to execute. Let’s evaluate these aspects n(with time taken):

  1. Descriptive analysis on the Data – 50% time
  2. Data treatment (Missing value and outlier fixing) – 40% time
  3. Data Modelling – 4% time
  4. Estimation of performance – 6% time

Note: The percentages are based on a sample of 40 competition, I have participated in past (rounded off).

Now we know where do we need to cut down time. Let’s go step by step into the process (with time estimate):

1.Descriptive Analysis : When I started my career into analytics, we used to primarily build models based on Logistic Regression and Decision Trees. Most of the algorithm we used involved greedy algorithms, which can subset the number of features I need to focus on.

With advanced machine learning tools coming in race, time taken to perform this task can be significantly reduced. For your initial analysis, you probably need not do any kind of feature engineering. Hence, the time you might need to do descriptive analysis is restricted to know missing values and big features which are directly visible. In my methodology, you will need 2 minutes to complete this step (I assume a data with 100,000 observations).

2.Data Treatment : Since, this is considered to be the most time consuming step, we need to find smart techniques to fill in this phase. Here are two simple tricks which you can implement :

  • Create dummy flags for missing value(s): In  general, I have discovered that missing values in variable also sometimes carry a good amount of information. For instance, if you are analyzing the clickstream data, you probably won’t have a lot of values in specific variables corresponding to mobile usage.
  • Impute missing value with mean/any other easiest method : I have found ‘mean’ works just fine for the first iteration. Just in cases where there is an obvious trend coming from Descriptive analysis,  you probably need a more intelligent method.

With such simple methods of data treatment, you can reduce the time to treat data to 3-4 minutes.

3. Data Modelling : I have found GBM to be extremely effective for 100,000 observation cases. In case of bigger data, you can consider running a Random Forest. This will take maximum amount of time (~4-5 minutes)

4. Estimation of Performance : I find k-fold with k=7 highly effective to take my initial bet. This finally takes 1-2 minutes to execute and document.

The reason to build this model is not to win the competition, but to establish a benchmark for our self. Let me take a deeper dive into my algorithm. I have also included a few snippets of my code in this article.

 

Let’s start putting this into action

I will not include my entire function to give you space to innovate. Here is a skeleton of my algorithm(in R):

Step 1 : Append both train and test data set together

Step 2 : Read data-set to your memory

setwd("C:\\Users\\Tavish\\Desktop\\Kagg\\AV")
complete <- read.csv("complete_data.csv", stringsAsFactors = TRUE)

Step 3: View the column names/summary of the dataset

colnames(complete )
[1] "ID" "Gender" "City"  "Monthly_Income" "Disbursed" "train"

Step 4: Identify the a) Numeric variable b) ID variables c) Factor Variables d) Target variables

Step 5 : Create flags for missing values

missing_val_var <- function(data,variable,new_var_name) {
data$new_var_name <- ifelse(is.na(variable),1,0))
return(data$new_var_name)}

Step 6 : Impute Numeric Missing values

numeric_impute <- function(data,variable) {
mean1 <- mean(data$variable)
data$variable <- ifelse(is.na(data$variable),mean1,data$variable)
return(new_var_name)
}

Similarly impute categorical variable so that all missing value is coded as a single value say “Null”

Step 7 : Pass the imputed variable into the modelling process

#Challenge: Try to Integrate a K-fold methodology in this step

create_model <- function(trainData,target) {
set.seed(120)
myglm <- glm(target ~ . , data=trainData, family = "binomial")
return(myglm) }

Step 8 : Make predictions

score <- predict(myglm, newdata = testData, type = "response")
score_train <- predict(myglm, newdata = complete, type = "response")

Step 9 : Check performance

auc(complete$Disbursed,score_train)

And Submit!

 

End Notes

Hopefully, this article would have given enough motivation to make your own 10-min scoring code. Most of the masters on Kaggle and the best scientists on our hackathons have these codes ready and fire their first submission before making a detailed analysis. Once they have some estimate of benchmark, they start improvising further. Share your complete codes in the comment box below.

Did you find this article helpful? Please share your opinions / thoughts in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Tavish Srivastava 26 Jun, 2020

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

amr
amr 18 Sep, 2015

That's a very well put list :-) Thanks Tavish.

deepakntural
deepakntural 18 Sep, 2015

This is a fantastic way to kick-off the model building. You are right that we should have such model handy while doing model building. If from the very beginning we start thinking about optimization of the model then this will take lots of time to develop the model. A step by step approach always helps to break the problem and get a reliable and quick outcome. I visit Analytics Vidhya almost daily and really like the articles published in the forum. You guys are doing fantastic work. My best wishes with the forum. Keep Learning, Keep Growing. Regards Deepak Sharma

Shivi
Shivi 18 Sep, 2015

Hi Tavish, Nice article. But here you did not mention who did you treat multicollinearity & non normally distributed data. Thanks, Shivi

Yogesh T
Yogesh T 21 Sep, 2015

Hi Tavish, I am currently working in IT, I am thinking to shift my career into analytics,Is it right decision. I am thinking to start with SAS and R training. What's your suggestion on this field. Regards Yogesh T

Hossein
Hossein 21 Sep, 2015

Dear Tavish, Like all of the time it was perfect

skappal7
skappal7 22 Sep, 2015

Thanks Tavish for sharing this awesomely crafted walk-through, I just wanted to seek your opinion about using Rattle to do modelling as I personally find it very easy and flexible and it doesn't require me to be a coder. Your thoughts will be highly appreciated.

Eric
Eric 10 Oct, 2015

This is a great article and I would really like to step through it using the same data that you use. Can you make the file complete_data.cv available or tell me where I can find it?

FrankSauvage
FrankSauvage 12 Oct, 2015

Thanks a lot for this inspiring article!

Didier
Didier 24 Nov, 2015

Hi, I am new to data analysis and would like to run your complete code to see the final result. Can you please share it with me, as well as the dataset? Cheers

ravi.adannavar@gmail.com
[email protected] 04 Feb, 2016

Hi Tavish, Could you please share the data file/atleast edited data file to practice on this. Thank you, Ravi A

Dinesh
Dinesh 10 Mar, 2016

Dear all, Am the beginner of creating modeling in a company, can anyone please help me with the complete process for creating a modeling for any data. Please explain about, 1. Data cleaning 2. SAS Codes 3. Model preparation 4. Algorithm used for model preparation. Please send the details to my Email, Thanks in advance. Dinesh

Siva
Siva 04 Jul, 2016

Hi Tavish, Its very nice article .Thank you . seems we need to do dimensionality reduction process before applying the Model.Could you please explain the Dimensionality reduction techniques( For Variable selection). Thanks in advance. Siva Vulli

namrata
namrata 24 Aug, 2016

can u suggest me how to handle missing values in different cases.

Kafi Khan
Kafi Khan 16 May, 2017

Hi I am new in R programming & Data Science, could you please tell me how to do step 4 ? "Step 4: Identify the a) Numeric variable b) ID variables c) Factor Variables d) Target variables" Thanks Advance Kafi Khan

annucoolz@gmail.com
[email protected] 20 May, 2017

Hello Tavish, I am new to data analystics, I want to know if my dependent variable has missing values, then will it be good to impute them (by mean or by other value) or should I delete them (entire record)....???

Nipun
Nipun 14 Jun, 2017

Hi, very interesting article...but thing is missing is the kind of data you have collected... Sometimes, K-fold and Logistic Regression will not work (in some specific data set and research objective). Thanks Regards Nipun Nikunj

Nekhita Sharma
Nekhita Sharma 26 Feb, 2018

Hi Tavish, Thanks for sharing such useful information with us. I also want to lean about data modeling, can you help me on that.

Satish M
Satish M 10 Apr, 2018

Hi Tavish, Really nice article , I appreciate your efforts . It will be good if you add something more on optimization side. Means after checking the performance what should be the focus area for example add/removing variable to reduce noise/error component and what should be the approach to improve performance. Thanks Satish Mishra