Samsung’s ConZNet Algorithm just won Two Popular NLP Challenges (Dataset Links Inside)

Pranav Dar Last Updated : 10 Jul, 2018

3 min read

Overview

ConZNet is a deep reinforcement learning algorithm developed by Samsung’s AI research arm
It was used to comprehensively win 2 extremely popular NLP competitions – Microsoft’s MS MARCO and University of Washington’s TriviaQA
Links to download the two datasets have been provided below

Introduction

2018 has been the year machine learning algorithms took control of NLP challenges once and for all. We saw Alibaba’s neural network approach beat out all human competitors on Stanford’s Reading Comprehension Test at the start of the year and the trend has continued.

This week, Samsung’s AI research arm used their unique reinforcement learning algorithm, called ConZNet, won two extremely tough yet popular Natural Language Processing (NLP) competitions. One of the competitions was organized by the University of Washington, called TriviaQA, and the other one was hosted by none other than Microsoft, called MS MARCO (MAchine Reading COmprehension).

Source: Betanews

ConZNet is a deep reinforcement learning algorithm. This means that each time the algorithm learns from it’s mistakes, it is “rewarded” accordingly. In case you are unfamiliar with this concept, I recommend going through this beginner friendly article.

Let’s take a quick look at the two datasets that Samsung applied their algorithm on.

MS MARCO

MS MARCO is Microsoft’s reading comprehension challenge dataset. It has neatly been divided into training, validation and test sets for you to dig into straightaway. It contains well over 1 million search queries from Bing and over 180,000 well formed answers.

In this competition, an AI algorithm is presented with with ten web documents to answer a certain query. To increase the complexity, Microsoft insists that the contestants use its Bing search engine. Queries are randomly selected and answers are evaluated statistically by estimating how close they are with human answers.

You can download the MS MARCO dataset here.

TriviaQA

TriviaQA is a massive reading comprehension dataset which contains over 650,000 questions and 95,000 question-answer pairs. The dataset also includes evidence documents that provide supervision help for answering the questions. The file size is 7.2GB so ensure you’re on a solid internet connection before you start the download!

You can download the dataset from here.

Our take on this

I am a big NLP enthusiast so news like this definitely gets me excited! What I like about these two competitions, and the Stanford Question Answering Dataset, are that they are open source. Anyone can download and work on them. I would love to see folks from Analytics Vidhya’s community develop algorithms that rival and compete in these challenges. It’s the perfect opportunity to apply all those NLP concepts you’ve been learning and measure your progress against global data scientists.

Coming to Samsung, this is quite a win for them because it props up their Bixby, their virtual assistant. I’m sure we’ll see ConZNet being applied within Bixby when the next update comes out.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

AVbytes

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Samsung’s ConZNet Algorithm just won Two Popular NLP Challenges (Dataset Links Inside)

Overview