February started on a high for us. “Last Man Standing” saw more than 1600 Data Scientists compete from all over the world making more than 5000 submissions in 3 days. Depending on the metric you look at, this was 40 – 60% higher participation and engagement compared to our previous hackathon – Black Friday.
Guess what – we are loving this high and are in no mood to come off the peak! We love the action over the weekend so much that we feel a void with out it. I know some of you also feel about hackathons in similar way and would love what I am going to announce next!
Launching Mini DataHack:
There are 2 aspects of our hackathons which people love about our hackathons:
- They come with tight deadlines. The action and the adrenaline rush are unparalleled.
- The tightly knit community ensures focus on learning and a very high engagement.
So, this time we are going to take both these aspects one notch up!
I am pleased to launch Mini DataHack, our shortest form of hackathon ever. The idea is very simple – we will design a problem, which would be focused on solving a type of problem. For example, the first Mini DataHack would be featuring a Time Series problem. By focusing on a single aspect and shortening the duration, we hope that the learning will grow tremendously!
We hope that this experience will give you more learning than you can imagine in 3 hours. So, make sure you register for our first ever Mini DataHack and go through the concepts of Time Series before hand.
Last man standing:
One of the feedback we heard from people during our last hackathons was that feature engineering wasn’t playing an important role in improving the score. Most of the time it was just the improvements coming from algorithms. We wanted to change this score this time. Hence, we designed a problem, where the key to winning would lie in feature engineering.
The Toxic Pesticides
Though, many of us don’t appreciate much, but a farmer’s job is real test of endurance and determination. Once the seeds are sown, he works days and nights to make sure that he cultivates a good harvest at the end of season. A good harvest is ensured by several factors such as availability of water, soil fertility, protecting crops from rodents, timely use of pesticides & other useful chemicals and nature. While a lot of these factors are difficult to control for, the amount and frequency of pesticides is something the farmer can control.
Pesticides are also special, because while they protect the crop with the right dosage. But, if you add more than required, they may spoil the entire harvest. A high level of pesticide can deem the crop dead / unsuitable for consumption among many outcomes. This data is based on crops harvested by various farmers at the end of harvest season. To simplify the problem, you can assume that all other factors like variations in farming techniques have been controlled for.
You need to determine the outcome of the harvest season, i.e. whether the crop would be healthy (alive), damaged by pesticides or damaged by other reasons.
Evaluation metrics for this challenge was Confusion_Matrix. A confusion matrix is an N X N matrix, where N is the number of classes being predicted. Here are a few definitions, you need to remember for a confusion matrix :
- Accuracy : the proportion of the total number of predictions that were correct.
- Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
- Negative Predictive Value : the proportion of negative cases that were correctly identified.
- Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
- Specificity : the proportion of actual negative cases which are correctly identified.
Winning strategies from the competition:
Rank 3: Mark Landry
I went for a vectorized way to calculate the difference between the current value of any feature and the one that was N before it or after it. I used 1-6 on both sides (+ and -) for the insect and dose features (for 24 total). Then I also added a few of those together (diffOfThisandThis+1 + diffOfThisandThis+2). Those started to overfit a tiny bit.
I really didn’t see what the models were doing to solve the problem, but once they were calculated, the H2O GBMs I was using started getting much more accurate. I noticed it because if you look at the insect feature, you can see a pattern of a similar value repeated several times, and sometimes it varies backwards just a tiny bit. Again, I can’t claim to have seen how that helps solve the problem–that part really was ML for my model. My job was to get the features in there.
Rank 2: Vopani
I particularly enjoyed creating the features and seeing it steadily improve the CV and LB. I found the time-series pattern pretty much straight away and from there, it was only uphill.
You can go through the code and get some ideas.
This model scored 0.9604 on the public LB and was ranked 2nd.
Certain tips I followed:-
- I chose the parameters by hand-tuning them, which gave me the least CV-score.
- Order of observations is important across train and test. Hence, I’m forming the groups by using them together and ordering by ‘ID’ to create the features.
- Imputing missing values with -1 is very common. It essentially informs the model that the value is missing and the model handles it differently. I’ve found this works better for tree-based models.
Rank 1: Bishwarup Bhattacharjee
This was my first competition in AV and I thoroughly enjoyed working on it. The data was quite standard and didn’t demand much of preprocessing. I started with an xgboost model with a few engineered features:
- Combination of Season and Soil_Type
- Combination of Season and Crop_Type
- Combination of Crop_Type and Soil_Type
- Total_dosage = Weekly dose frequency * total weeks used
- Insect count by dose frequency = ration of Estimated_Insect_Count/Number_Doses_Week
- Weeks_since_pesticide_use = Number of weeks used + Number of doses quit
- Ratio of number of weeks used by number of weeks used – this was of course only present in case of pesticide use category = 2
The first three didn’t offer any improvement but the rest four of the features helped my model to some extent and at that point my LB was around 0.849.
I tried tuning the hyper parameters with a 5-fold stratified CV, but that also didn’t help much in my case. One more thing that didn’t work out is one-hot-encoding the nominal features like Season and Pesticide_Use_Category.
I also tried Keras for my NN model, which gave me a CV of 0.8414 and LB around 0.842.
Being stuck at this point, I went back to exploring the data and started searching for potential signals for engineering more features. While scanning the data carefully I came across the fact that the data consists of multiple batches of similar estimated insect counts and the batches are strictly non-decreasing in the response values. There were one or more fixed patterns present in the data which are as below:
- In the data (after combining the train and test set), there are contiguous blocks of Estimated Insect Count with max difference of 1.
- Inside each such contiguous blocks there are sublocks first with Crop_Type = 0 and then Crop_Type = 1.
- Inside subblocks there are mini-blocks with Soil_Type = 0 and then Soil_Type = 1
- Inside each mini blocks there is a steady monotonically non decreasing sequence of response values that holds for 100% of the data.
- Since the pattern is deterministic and not probabilistic, I chose to write my own algorithm to modify my model outputs with that instead of engineering features that I can feed to my model. This alone took me to 94.2% on the public LB.
- Another critical pattern is that- inside each miniblock & response value combination there is steady non-decreasing pattern of Number_Doses_Week which holds for 99.32% of the data – so this can also be treated as more or less deterministic. This helped me to correct the transition entries from 0 to 1 as the pattern only holds for the response value 0 and not 1 and 2.
- After I post processed my outputs from step 5 with the step 6 algorithm (which I also written separately) – it took my score to 96.4%.
Takeaways from Last Man Standing
Here are a few key learning I would emphasize from the competition:
- Try understanding the problem by visualizing the data set and building the features. The best data scientists I know of (even outside the competition) focus on building the best features.
- As SRK said during the competition, even simple scatter plots and line plots would help.
- For problems with imbalanced classes, it is usually easy to get a basic solution – in this case, predicting all crops will survive. But improving by fractions needs a lot more work.
Here are a few visualizations which were shared on the slack channel during the competition
Highlights of the hackathon:
While the entire experience was enriching, here are a couple of highlights for those who missed the action:
- Bishwarup taking last minute stride – Vopani looked to dominate this competition from the start. He crossed a score of 0.9 within 24 hours and never gave up on the lead. The competition looked like a natural fit for the sudoku lover! But Bishwarup had some other thoughts, he surpassed Vopani in the last 2 hours taking away the title and the prize!
- Vopani & SRK dropping hints – This was one of the competitions where SRK was trying to catch up against Vopani. But once he crossed a score of 0.9, both of them started dropping hints to help fellow members break the code. Ultimately Nalin, Mark and SumanthPrabhu joined the club as well.
Updated user rankings:
If you haven’t noticed, the ranks and badges have been updated. Here is how the points of top 5 users stack up pre and post Last Man Standing (LMS):
Aayushmnit & SRK are now neck to neck for overall ranking. Also, Vopani has jumped from rank 15 to rank 6. You can see the updated rankings here.
I hope you learnt a lot from this hackathon. I want to thank all the participants and the community members for the success of this hackathon and would see you around the next innovation tomorrow – Mini DataHack.
The next Hackathons are LIVE now for registrations. Participate Now!
Short Hackathon: Mini DataHack
Signature Hackathon: Date Your Data