What an eventful and intense hackathon weekend! Our recently concluded McKinsey Analytics Online Hackathon saw an overwhelming number of people participate and compete to win a NIPS conference ticket + $5,000, as well as interview opportunities with McKinsey Analytics. The entries kept on rolling as the weekend progressed, leading to a thrilling deadline finish.
The hackathon saw over 5,000 data scientists and aspiring data science enthusiasts vying for a spot at the top of the leaderboard. As parameters were tweaked, and models were finalized, the number of submissions soared to over 11,000! A huge shout out to all who participated and made this a wonderfully competitive hackathon.
Check out a few of the comments from the participants:
“Thanks a lot Analytics Vidhya team for this interesting hackathon. I learnt a lot during these 3 days. I am really impressed by the promptness of your whole team. They were responding to our queries even during off hours and weekend, unlike some other platforms where queries are not answered even during working hours! This level of engagement has motivated me to prioritize AnalyticsVidhya’s Hackathons over any other.” – Deepak Rawat
“Thanks @kunal and Complete AV team for such amazing competition. I appreciate your efforts and dedication. Hats off to you all.” – MohitK
“Thanks AV team for organizing and driving another great hackathon!” – b.e.s (top solution)
You can check out the leaderboard standings here.
Those who were not able to participate missed out on one of the most exciting hackathons Analytics Vidhya has ever hosted. But don’t worry, we have plenty in store for you ahead. Head over to our DataHack platform and check out upcoming events and practice problems.
Problem Statement & Evaluation Criteria
The problem statement posed by McKinsey Analytics was a fascinating one. The participants were tasked with building a model for an insurance company for predicting the propensity to pay renewal premium and build an incentive plan for its agents to maximize the net revenue (i.e. renewals – incentives given to collect the renewals) collected from the policies post their issuance.
The participants were provided information about past transactions from the policy holders along with their demographics. Given this information, the the challenge was to predict the propensity of renewal collection and create an incentive plan for agents (at policy level) to maximise the net revenues from these policies.
The final solutions will be evaluated on 2 criteria:
- The base probability of receiving a premium on a policy without considering any incentive
- The monthly incentives participants have provided on each policy to maximize the net revenue
A Few Top Solutions for Cracking this Hackathon
There were some amazing approaches that went into this hackathon. We have shared a few of them in this section to help you understand how to structure and compete in these hackathons.
AV username – b.e.s
- Total number of late payments
- Overall number of payments: late payments + no_of_premiums_paid
- Weighted number of payments: late_3_6 * 3 + late_6_12 * 6 + late_12 * 12 + no_of_premiums_paid
- Age in years
- Target encoding for sourcing channel
inc = np.array(range(10000)) rev = row['prob']*(1+(1-np.exp(-2*(
1-np.exp(-inc/400))))/5)*row[' premium'] - inc best_inc = np.argmax(rev)
AV username – Konstantin
- Extra features: share of 3-6, 6-12 and 12+ among all of the late_payments and some “magic” as age/no.of.payments
- Models: 0.9*LGB (averaging of 10 models with different seeds, some tuning: high min_data was great choice) + 0.1*MLP (from nnet library)
- Cross-validation: Repeated 5-Folds (10 times)
- CV AUC ~0.845
It was a very simple approach:
- Assume that benchmarks are close to our predictions, then score on the leaderboard will be really low.
- There is a suspicion that benchmarks probabilities could be obtained from extended data with the features which are more relevant to the target
- If we set benchmarks = 0.5, then score on the leaderboard will be significantly higher as compared with step 1
- In conclusion, can we use our predictions to improve our score? The answer is YES. Just scale our predictions to the interval from 0.4 to 0.6 and the score will be a little bit higher
AV username – Naveenkb
- None of the feature engineering variables worked
- Parameter tuning had little to no help
- LightGBM gave a better result than KNN, Random Forest, Logistic Regression, ANN
- Ensemble of the above models using Logistic Regression or Random Forest or simple GBM did not work
- Simple over sampling and under sampling
- Based on the best available fit, he removed some variables which were classified as zero with high confidence and vice versa
- Multiple LightGBM with different parameter and seeds. Ensembled it using a power average of 2
- Manually changing the hyperparameter worked better than Bayesian optimization
- Optimized for each insurance premium using scipy optimize
- Since LightGBM probability is not true probability but rather a real value between 0 and 1, he reduced each predicted probability by 35 percent
As I mentioned, this was a thrilling hackathon that required data scientists to dig deep and come up with some innovative solutions. We have shared the above approaches with the aim of helping you understand how the top finishers think about, and approach these competitions.
Make sure you take part in all our future hackathons on the DataHack platform. You can also polish your skills with the plethora of practice problems available on the platform.You can also read this article on Analytics Vidhya's Android APP