- Hackathons are a wonderful opportunity to gauge your data science knowledge and compete to win lucrative prizes and job opportunities
- Here are the top 3 approaches from the Innoplexus Sentiment Analysis Hackathon – a superb NLP challenge
I’m a big fan of hackathons. I’ve learned so much about data science from participating in these hackathons in the past few years. I’ll admit it – I have gained a lot of knowledge through this medium and this, in turn, has accelerated my professional career.
This comes with a caveat – winning a data science hackathon is really hard. Just think about the number of obstacles in your way:
- A brand new problem statement we haven’t worked on before
- A plethora of top data scientists competing to rise up the leaderboard
- Time crunch! We have to understand the problem statement, put together a framework, clean the data, explore it, and build the model in a matter of a few hours
- And then repeat the process!
A single decimal point could be the difference between the top 10 and the top 50. Isn’t this why we love hackathons in the first place? The thrill of seeing our hard work pay off with a rise in the leaderboard rankings is unparalleled.
So, we’re thrilled to bring to you the top 3 winning approaches from the Innoplexus Sentiment Analysis hackathon! You are going to be awestruck by how these three top data scientists thought through their solutions and came up with their own unique framework.
There is a LOT to learn from these approaches. Trust me, take the time to go through the steps and understand where they came from. And then think if you would have done anything differently. And then – go ahead and take part in these hackathons yourself on our DataHack platform!
So let’s begin, shall we?
About the Innoplexus Sentiment Analysis Hackathon
It’s always an exciting prospect, hosting hackathons with our partner Innoplexus. Each time they come up with problem statements that are based on Natural Language Processing (NLP), an immensely popular field right now. We have seen huge developments in NLP thanks to transfer learning models such as BERT, XLNet, GPT-2, etc.
And sentiment analysis is one of the most common NLP projects data scientists tend to work on. This Innolpexus hackathon was a 5-day contest with more than 3200 data scientists across the globe competing for job opportunities and exciting prizes offered by Innoplexus.
It was a hard-fought contest with a total of 8000+ submissions and a variety of approaches employed by the best in the business to occupy the top spots.
For those of you who could not make it to the top, or otherwise could not find time to work on the problem, we have collated the winners’ approach and solutions to help you appreciate and learn from these. So here goes.
Problem Statement for the Innoplexus Sentiment Analysis Hackathon
There are a lot of components that go into building the narrative of a brand. It isn’t just built and controlled by the company that owns the brand. Think about any big brand you are familiar with and you’ll instantly understand what I’m talking about.
For this reason, companies are constantly looking out across various platforms, such as blogs, forums, social media, etc. for checking the sentiment around their various products and also competitor products to learn how their brand resonates in the market. This analysis helps them in various aspects of their post-launch market research.
This is relevant for a lot of industries, including pharma and their drugs.
But this comes with several challenges. Primarily, the language used in this type of content is not strictly grammatically correct. We often come across people using sarcasm. Others cover several topics with different sentiments in one post. Other users post comments to indicate their sentiment around the topic.
Broadly speaking, sentiment can be clubbed into 3 major buckets – Positive, Negative and Neutral Sentiments.
In the Innoplexus Sentiment Analysis Hackathon, the participants were provided with data containing samples of text. This text could potentially contain one or more drug mentions. Each row contained a unique combination of the text and the drug mention. Note that the same text could also have different sentiments for a different drug.
Given the text and drug name, the task was to predict the sentiment for texts contained in the test dataset. Given below is an example of text from the dataset:
Stelara is still fairly new to Crohn’s treatment. This is why you might not get a lot of replies. I’ve done some research, but most of the “time to work” answers are from Psoriasis boards. For Psoriasis, it seems to be about 4-12 weeks to reach a strong therapeutic level. The good news is, Stelara seems to be getting rave reviews from Crohn’s patients. It seems to be the best med to come along since Remicade. I hope you have good success with it. My daughter was diagnosed Feb. 19/07, (13 yrs. old at the time of diagnosis), with Crohn’s of the Terminal Illium. Has used Prednisone and Pentasa. Started Imuran (02/09), had an abdominal abscess (12/08). 2cm of Stricture. Started Remicade in Feb. 2014, along with 100mgs. of Imuran.
The above text is positive for Stelara and negative for Remicade. Now that we have a solid understanding of what the problem at hand was, let’s dive into the winning approaches!
Winners of the Innoplexus Sentiment Analysis Hackathon
As I mentioned earlier, winning a hackathon is extremely difficult. I loved going through these top solutions and approaches provided by our winners. First, let’s look at who won and congratulate them:
Here are the final rankings of all the participants on the Leaderboard.
The top 3 winners have shared their detailed approach from the competition. I am sure you are eager to know their secrets so let’s begin.
Rank 3: Mohsin Hasan Khan (ML Engineer @HealthifyMe)
Here’s what Mohsin shared with us:
“My final solution is an ensemble of BERT and XLNet runs.”
- My first impression of the data suggested there were a lot of wrong labels as per my perception of negative and positive sentiment. So, I felt it would be really difficult to handcraft the features. I decided it would be best to stick to the state-of-art NLP models to learn on noisy data
- I started with a simple Tf-idf plus logistic regression model which gave me a cross-validation (CV) score of 0.5. After looking at text data, I realized there were many rows that had a lot of lines unrelated to the drug
- Hence, I decided to use only sentences that had a drug name occurring in them. Tf-idf plus logistic regression with drug name-only sentences gave me a CV score of 0.54
- At this point, I decided to use BERT. Without any finetuning, I only got a CV score of 0.45. But, once I let BERT finetune on training data, it gave a validation score of 0.60 and a leaderboard score of 0.59. Then, I added sentences that occurred before and after the drug sentence – this increased the CV score slightly. Then I used BERT-large and finetuned it which gave me a CV score of 0.65 and a leaderboard score of 0.61. Similarly, I finetuned the XLNet base, which gave a CV score of 0.64 and leaderboard 0.58. My final solution is an ensemble of BERT and XLNet runs
- Note: 5-Fold stratified K-Fold as class distribution was imbalanced
- Check out the code of Mohsin here.
Rank 2: Harini Vengala (Statistical Analyst @WalmartLabs India)
Here’s what Harini shared with us:
“My final model was an ensemble of 3 BERT and 1 AEN.”
- My first baseline approach was using a count vectorizer. This got me a public leaderboard score of 0.42. I removed digits, emojis, URLs, punctuations, and converted the text to lowercase
- I tried BERT and tuned it for the given data set. Next, I removed stop words from the text as BERT requires more memory if I run it on the entire passage. I took sequence length as 150, but I realized most of the important information is ignored in this approach. I didn’t cross the 0.50 score on the public leaderboard
- So, what else could I try? I took only the sentences in which the given drug was present and used BERT again to classify sentiment. This gave me a score of 0.60 on the public leaderboard. I also implemented the Attentional Encoder Network (AEN) for Targeted Sentiment Classification which resulted in 0.56 score
- My final model was an ensemble of 3 BERT and 1 AEN. The loss function I used was CrossEntropyLoss with class weights = 1/number of observations in each corresponding class
- My key takeaway – try different things and check what works according to the data. Spend some time listing what are all the things you can try during the hackathon
- Check out the code of Harini here.
Rank 1: Melwin Babu (Data Scientist @nference)
Here’s what Melwin shared with us:
“I noticed pretty early that increasing the max sequence length increased the score sufficiently. This observation more or less dictated my approach. I used a basic XLNet model with hardly any feature engineering.”
- I lowercased all the sentences and masked the relevant drug in the sentences. Then, I proceeded to take the first 1380 tokens after the sentence piece tokenization
- I chose to fill the GPU RAM with as much max sequence length as possible and refrained from using extra features. I tried to add variations to the data but made implementation mistakes and ran out of time
- In the final model, I have used 6 seeds to average the XLNet base cased model’s predictions. Time didn’t permit me for any other hyperparameter tuning
- Really surprised that in a competition where deep learning could be the best solution, I could be competitive with just an 8 GB GPU RAM machine
- Identifying the differences in train and test distribution may be crucial. Most of the things are the same as other hackathons
- Check out the code of Melwin here.
It was great fun interacting with these winners and getting to know their approach during the competition. This is a tightly contest hackathon and as you have already seen, the winning approaches were supremely awesome.
I encourage you to head over to the DataHack platform TODAY and participate in the ongoing and upcoming hackathons. It will be an invaluable learning experience!