Learn everything about Analytics

Top Highlights from AV’s Record-Breaking Weekend Online Hackathon

Introduction

What an eventful and intense hackathon weekend! Our recently concluded McKinsey Analytics Online Hackathon saw an overwhelming number of people participate and compete to win a NIPS conference ticket + $5,000, as well as interview opportunities with McKinsey Analytics. The entries kept on rolling as the weekend progressed, leading to a thrilling deadline finish.

The hackathon saw over 5,000 data scientists and aspiring data science enthusiasts vying for a spot at the top of the leaderboard. As parameters were tweaked, and models were finalized, the number of submissions soared to over 11,000! A huge shout out to all who participated and made this a wonderfully competitive hackathon.

Check out a few of the comments from the participants:

“Thanks a lot Analytics Vidhya team for this interesting hackathon. I learnt a lot during these 3 days. I am really impressed by the promptness of your whole team. They were responding to our queries even during off hours and weekend, unlike some other platforms where queries are not answered even during working hours! This level of engagement has motivated me to prioritize AnalyticsVidhya’s Hackathons over any other.” – Deepak Rawat
“Thanks @kunal and Complete AV team for such amazing competition. I appreciate your efforts and dedication. Hats off to you all.” – MohitK
“Thanks AV team for organizing and driving another great hackathon!” – b.e.s (top solution)

You can check out the leaderboard standings here.

Those who were not able to participate missed out on one of the most exciting hackathons Analytics Vidhya has ever hosted. But don’t worry, we have plenty in store for you ahead. Head over to our DataHack platform and check out upcoming events and practice problems.

 

Problem Statement & Evaluation Criteria

The problem statement posed by McKinsey Analytics was a fascinating one. The participants were tasked with building a model for an insurance company for predicting the propensity to pay renewal premium and build an incentive plan for its agents to maximize the net revenue (i.e. renewals – incentives given to collect the renewals) collected from the policies post their issuance.

The participants were provided information about past transactions from the policy holders along with their demographics. Given this information, the the challenge was to predict the propensity of renewal collection and create an incentive plan for agents (at policy level) to maximise the net revenues from these policies.

The final solutions will be evaluated on 2 criteria:

  1. The base probability of receiving a premium on a policy without considering any incentive
  2. The monthly incentives participants have provided on each policy to maximize the net revenue

 

A Few Top Solutions for Cracking this Hackathon

There were some amazing approaches that went into this hackathon. We have shared a few of them in this section to help you understand how to structure and compete in these hackathons.

 

AV username – b.e.s

He managed to secure the first position on the public leaderboard. In a break from the usual hackathon frameworks, there was almost no feature engineering involved here. The features he found useful were:
  • Total number of late payments
  • Overall number of payments: late payments + no_of_premiums_paid
  • Weighted number of payments: late_3_6 * 3 + late_6_12 * 6 + late_12 * 12 + no_of_premiums_paid
  • Age in years
  • Target encoding for sourcing channel
He went with a stack of 6 models (4 LightGBM + 2 XGBoost) on different feature sets and with different parameters. Also, he was optimizing either ROC-AUC or binary logloss for these models. The first, because we had ROC-AUC in the final score, the second to get closer to benchmark probabilities in the optimization.
As for the optimization part, he used a simple argmax over integer grid for each observation.
inc = np.array(range(10000))
rev = row['prob']*(1+(1-np.exp(-2*(1-np.exp(-inc/400))))/5)*row['premium'] - inc
best_inc = np.argmax(rev)
According to him, the trickier part was that the probabilities were a bit overestimated and he had to multiply each incentive with a factor of 0.75.

AV username – Konstantin

Konstantin divided his approach in two parts.
Part 1
  • Extra features: share of 3-6, 6-12 and 12+ among all of the late_payments and some “magic” as age/no.of.payments
  • Models: 0.9*LGB (averaging of 10 models with different seeds, some tuning: high min_data was great choice) + 0.1*MLP (from nnet library)
  • Cross-validation: Repeated 5-Folds (10 times)
  • CV AUC ~0.845
Part 2
It was a very simple approach:
  • Assume that benchmarks are close to our predictions, then score on the leaderboard will be really low.
  • There is a suspicion that benchmarks probabilities could be obtained from extended data with the features which are more relevant to the target
  • If we set benchmarks = 0.5, then score on the leaderboard will be significantly higher  as compared with step 1
  • In conclusion, can we use our predictions to improve our score? The answer is YES. Just scale our predictions to the interval from 0.4 to 0.6 and the score will be a little bit higher

AV username – Naveenkb

Navin was also among the top finishers in this hackathon. His was also a two part approach.
Part 1
What didn’t work for him:
  • None of the feature engineering variables worked
  • Parameter tuning had little to no help
  • LightGBM gave a better result than KNN, Random Forest, Logistic Regression, ANN
  • Ensemble of the above models using Logistic Regression or Random Forest or simple GBM did not work
  • Simple over sampling and under sampling
What worked:
  • Based on the best available fit, he removed some variables which were classified as zero with high confidence and vice versa
  • Multiple LightGBM with different parameter and seeds. Ensembled it using a power average of 2
  • Manually changing the hyperparameter worked better than Bayesian optimization
This got him a CV score of 0.8453.
Part 2
  • Optimized for each insurance premium using scipy optimize
  • Since LightGBM probability is not true probability but rather a real value between 0 and 1, he reduced each predicted probability by 35 percent

End Notes

As I mentioned, this was a thrilling hackathon that required data scientists to dig deep and come up with some innovative solutions. We have shared the above approaches with the aim of helping you understand how the top finishers think about, and approach these competitions.

Make sure you take part in all our future hackathons on the DataHack platform. You can also polish your skills with the plethora of practice problems available on the platform.

You can also read this article on Analytics Vidhya's Android APP Get it on Google Play

One Comment

  • Danish Xavier says:

    This is really good, as after every Hackathon I am learning how the top players think and solves the problem. Keep up the good work analytics vidhya.