# Framework to build logistic regression model in a rare event population

*Only 531 out of a population of 50,431 customer closed their saving account in a year, but the dollar value lost because of such closures was more than $ 5 Million.*

The best way to arrest these attrition was by predicting the propensity of attrition for individual customer and then pitch retention offers to these identified customers. This was a typical case of modeling in a rare event population. This kind of problems are also very common in Health care analytics.

In such analysis, there are two challenges :

- Accurate prediction is difficult because of small sample bias.
- The accuracy of prediction need to be extremely high to make an implementable strategy. This is because high number of false positive, unnecessarily burdens the retention budgets.

We can find number of statistical papers on this specific problem. This article will collect the best practices and layout the step by step process to make a logistic regression model in a rare event population.

**[stextbox id=”section”]Why not simply make a logistic regression model on the population?**** [/stextbox]**

The problem basically is that maximum likelihood estimation of the logistic model is well-known to suffer from small-sample bias. And the degree of bias is strongly dependent on the number of cases in the less frequent of the two categories. Try estimating the degree of bias in each of the following samples:

A. 20 events in a sample size of 1000 (Response rate : 2%)

B. 180 events in a sample size of 10000 (Response rate : 1.8%)

C. 990 events in a sample size of 1000 (Response rate : 99%)

Try not to see the answer below before you have your answer ready.

The correct answer here is C > A > B . C will suffer with the problem of small-sample bias most. Confused? Did we not say this problem exist in cases where events are too low? The problem is not specifically the rarity of events, but rather the possibility of a small number of cases on the rarer of the two outcomes. Why “A>B”? Its simply because of the population size. Even though the response rate in B is lesser than A, A struggles with the problem more than B. Hence, smaller the sample size ,higher is the risk of small sample bias.

**[stextbox id=”section”] What is the solution in such problems?**** [/stextbox]**

The solution in such problems is slightly longer than a normal logistic regression model. In such cases, we make a biased sample to increase the proportion of events. Now, we run logistic regression on the sample created. Once we have the final Logit equation, we transform the equation to fit the entire population.

**[stextbox id=”section”]Case study****: [/stextbox]**

Let’s consider the case in hand and walk through the step by step process. We have a population of 50,431 customers out of which 531 attrite in 12 months. We need to predict the probability of attrition, minimizing the false positives.

**Step 1 :Select a biased sample**

Total number of non attritors in the population is 49,900. We plan to take a sample of 1000 customers. As a thumb rule, we select 25% of the sample size as the responders. Hence, we select 250 customers out of the 531 attriting customers. And rest 750 come from the 49,900 base. This sample of 1000 customers is a biased sample we will consider for our analysis.

**Step 2 : Develop the regression model**

We now build a logistic regression model on the biased sample selected. We make sure that all the assumptions of the logistic regression are met and we get a reasonable lift because the lift tends to decrease after the transformations.

**Step 3 : Overlay equation on the population:**

Using the equation found in step 2, get the number of attritors in each decile of the overall population. In the table below, -Log odds (Predicted) directly comes from the regression equation. Using this function, one can find the Predicted attrition for each decile.

** Step 4: Solve for intercept and slope transformation**

Using the actual and the predicted decile value of the log odds, we find the slope and the intercept required to transform the equation of the sample to the equation of the population. This equation is given by,

{- Log odds (actual)} = slope * {-Log odds(predicted)} + Intercept

Find the slope and intercept using the 10 data-points, each corresponding to each decile.

In this case slope is 0.63 and the intercept is 1.66 .

As seen from the above figure, the actual and the transformed logit curve for each decile is much closer compared to the predicted curve.

**Step 5: Validate the equation on out of time sample :**

Once we reach a final equation of the logit function, we now validate the same on out of time samples. For the case in hand, we take a different cohort and compile the lift chart. If the model holds on out of time as well, we are good to go.

**[stextbox id=”section”] End Notes**

**: [/stextbox]**

Did you find the article useful? Share with us any other techniques you incorporate while solving a rare event problem. Do let us know your thoughts about this article in the box below.

## 21 thoughts on "Framework to build logistic regression model in a rare event population"

## dimalvov says: January 28, 2014 at 8:10 am

Hello Tavish it is a nice post. I usually prefer to oversample the cases with such probability that would equalize the proportions in cases/controls though.## saras says: March 02, 2014 at 7:33 pm

Hello you say "Using the equation found in step 2," but there is no equation given. Can you give this equation please and explain how you get the values given in the table below part 3 ? I am working on a real life problem related to rare events in a children's hospital. I would like to your method to see if I can predict readmission. thanks. saras## Tavish Srivastava says: March 02, 2014 at 9:03 pm

Hi Saras, Equation found in step 2 is the equation you get after running the regression model o biased sample. Table 3 can be directly generated similar to one you make for finding the ks of model. Do let us know of this is still not clear. Tavish## Mark says: March 07, 2014 at 8:04 pm

Very good post. Thanks! I am not sure how did you calculate the transformed value and slop/intercept. Could you explain more? Mark## Tavish says: March 07, 2014 at 8:13 pm

Mark, What we are trying to do here is transform the found logistic equation using the actual unbiased sample. The first thing you need to do is to sort your data-points in the order of predicted attrition. Now decile your population and find the logit for actual attrition and predicted attrition. Now using the equation "{- Log odds (actual)} = slope * {-Log odds(predicted)} + Intercept " find the slope and intercept adjustment factor. Hope this helps. Tavish## Vijay says: March 14, 2014 at 8:04 am

Hi Tavish, I would appreciate if you could also post the data so that we can practice on the same data and match the output mentioned in the post. In that way it will help us to be sure that we are learning in the right direction Thanks Vijay## Tavish says: March 17, 2014 at 4:19 am

Vijay, Thanks for the suggestion. We will soon be releasing some relevant training which will have all the data set. This particular data set is however confidential but we will share a dummy data set with the training. Do let me know in case you have any questions on the methodology. Tavish## Tavish says: March 17, 2014 at 4:29 am

Over sampling is a must. It will really helpful if you share how do you translate the found equation on oversample to actual population. Tavish## Sumeet Kapoor says: May 02, 2014 at 1:46 pm

Hi Tavish, My question is related to concordance in logistic regression. Can it be 100%? If yes, what does it signify?## Dataminer says: May 02, 2014 at 10:57 pm

I am not clear as to what we mean by out of time samples. Or does this just mean the test data which is of the subsequent time period. Also do you get good results using this technique as compared to using say Random forests or an ensemble. Also consider the example of emplyee churn wherein the actual churn data is less. Does it make sense to get the churn data from previous years whcih may not be the period under which the study is carried out. i.e if I want to study the problem from the data of 2 years and the data is insufficient then I take only the churned cases data from a period before 2 years so as to avoid the class imbalance problem in logistic regression.## Eleni says: June 10, 2014 at 1:48 pm

Hi Tavish, Very useful post, indeed. It is mentioned 'As a thumb rule, we select 25% of the sample size as the responders'. Would you mind explaining why you selected a 25-75 split rather than a 50-50 split? So far I have been always creating a 50-50 split (using all the responders and taking a random sample of non responders, of equal size), and would appreciate your thoughts on this. Thanks, Eleni## Vishal says: July 28, 2014 at 1:00 am

I would have built the model using *all* 531 events. My rule of thumb, when over-sampling due to a rare event, is to do a 50/50 split. In your example, 531 events + 531 non-events would have been my choice for the modeling dataset. When the even is so rare, it becomes imperative that we use as many events as possible (i.e., all).## Alexey says: July 30, 2014 at 4:23 pm

Nice post, concise and to the point. Nevertheless it would be nice to show the original problem on your example. Could you add a plot of predictions when using normal logistic regression? Thanks!## rakshit.gadgi says: September 17, 2014 at 7:40 am

Hi, I am working on a propensity scoring where I have taken a sample of 140k HHs from my population of around 60 mil . 1) on this sample I have a 2% response rate, and I have taken a biased sample dataset of 50:50 ratio ( that is around 7000 HHs ) and I used this as my training dataset 2) Now for validation on the unbiased sample of 140k, should I use the intercept transformation formula ? to get the performance of the model on the unbiased sample ? 3) And how to convert my results from the sample of 140k to the entire population ? it would be nice if someone helps me out with this## Neelima says: October 30, 2014 at 4:39 am

Hi All, Excellent post. We followed almost exactly the same process in one of our projects for modeling a rare event with a response rate of 0.0025%! I have one small suggestion here. In step 2, since the logistic model is built only on a sample of the non-events, it will be very important to ensure the beta stability so that the betas can be generalized to the population. After we build the model on one over-sampled dataset, we can use the same variables selected and build the models on multiple over-sampled datasets (by choosing different random sub-samples of non-events - 100 or so such sub-samples). Later, we can check if the sign of the beta estimate the same in all the 100 samples or not, and also look at the deviation in the magnitude of the beta estimate. Ideally, the range of these beta estimates across all the 100 odd samples should be very small. Once we confirm this, we can continue with steps 4 and 5 mentioned above (correcting for the intercept after scoring the entire population). We can choose different set of non-events for development and validation so that we don't over-fit our models. Any thoughts? Thanks, Neelima## Chandra says: January 31, 2015 at 1:54 am

Hi Tavish, Excellent article on strategies to process rare event data. Can you provide some insight into how to select the 750 'non event' data, assuming a sample size of 1000? Is there a robust criteria to select the non events from the huge population? I plan to use all 'event' data. Appreciate your response.## Balaji Balagangadharan says: August 07, 2015 at 9:38 am

Hi Tavish, I have a similar problem where in I had to deal with rare event scenario for Logistic regression. Your post was really useful and informative. Can you please explain the step where you find the adjustment for slope and Intercept ? . A sample for finding that would really help me?.. Thanks, Balaji.B## Saeed says: October 09, 2015 at 8:36 am

Hi Tavish, Thanks for your nice guide. However, it will be more useful if you share a data set (even a dummy one) and the codes for each step. Best, Saeed## vidhi says: November 10, 2015 at 11:09 pm

I am not getting how do we calculate the actual log odds? i tried this...log((21/5043)/(1-(21/5043))). But this doesn't seem to work. Can someone pls help?## Hessian says: December 16, 2015 at 12:52 am

@vidhi, he's using log_10 instead of log_e to get the actual logit## Harneet says: October 26, 2016 at 12:42 pm

Hi Tavish, I have understood till the point of calculating Slope and Intercept, but unable to understand how to calculate Transformed and where to use. Can you please elaborate. Regards, Harneet.