Samsung’s ConZNet Algorithm just won Two Popular NLP Challenges (Dataset Links Inside)
- ConZNet is a deep reinforcement learning algorithm developed by Samsung’s AI research arm
- It was used to comprehensively win 2 extremely popular NLP competitions – Microsoft’s MS MARCO and University of Washington’s TriviaQA
- Links to download the two datasets have been provided below
2018 has been the year machine learning algorithms took control of NLP challenges once and for all. We saw Alibaba’s neural network approach beat out all human competitors on Stanford’s Reading Comprehension Test at the start of the year and the trend has continued.
This week, Samsung’s AI research arm used their unique reinforcement learning algorithm, called ConZNet, won two extremely tough yet popular Natural Language Processing (NLP) competitions. One of the competitions was organized by the University of Washington, called TriviaQA, and the other one was hosted by none other than Microsoft, called MS MARCO (MAchine Reading COmprehension).
ConZNet is a deep reinforcement learning algorithm. This means that each time the algorithm learns from it’s mistakes, it is “rewarded” accordingly. In case you are unfamiliar with this concept, I recommend going through this beginner friendly article.
Let’s take a quick look at the two datasets that Samsung applied their algorithm on.
MS MARCO is Microsoft’s reading comprehension challenge dataset. It has neatly been divided into training, validation and test sets for you to dig into straightaway. It contains well over 1 million search queries from Bing and over 180,000 well formed answers.
In this competition, an AI algorithm is presented with with ten web documents to answer a certain query. To increase the complexity, Microsoft insists that the contestants use its Bing search engine. Queries are randomly selected and answers are evaluated statistically by estimating how close they are with human answers.
You can download the MS MARCO dataset here.
TriviaQA is a massive reading comprehension dataset which contains over 650,000 questions and 95,000 question-answer pairs. The dataset also includes evidence documents that provide supervision help for answering the questions. The file size is 7.2GB so ensure you’re on a solid internet connection before you start the download!
You can download the dataset from here.
Our take on this
I am a big NLP enthusiast so news like this definitely gets me excited! What I like about these two competitions, and the Stanford Question Answering Dataset, are that they are open source. Anyone can download and work on them. I would love to see folks from Analytics Vidhya’s community develop algorithms that rival and compete in these challenges. It’s the perfect opportunity to apply all those NLP concepts you’ve been learning and measure your progress against global data scientists.
Coming to Samsung, this is quite a win for them because it props up their Bixby, their virtual assistant. I’m sure we’ll see ConZNet being applied within Bixby when the next update comes out.
Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!