Exploring the AI Nexus with Matthew Honnibal

Nitika Sharma Last Updated : 13 Jun, 2024

4 min read

In the latest episode of Leading with Data, we had the pleasure of hosting Matthew Honnibal, the founder of Explosion AI and creator of the widely-used spaCy NLP library. Matthew’s mission is to democratize the development of language technologies, making it accessible beyond those with advanced degrees in the field. With a prolific background in both theoretical and practical aspects of natural language processing (NLP), Matthew has significantly contributed to the advancement of the domain. His work includes over 20 peer-reviewed publications, breakthrough contributions in parsing conversational speech, and impactful projects that bridge the gap between research and real-world applications.

You can listen to this episode of Leading with Data on popular platforms like Spotify, Google Podcasts, and Apple. Pick your favorite to enjoy the insightful content!

Key Insights from our Conversation with Matthew Honnibal

The evolution of NLP has been significantly influenced by deep learning and pre-trained transformers, which have changed the way models are trained and utilized.
Large language models (LLMs) like GPT-3 and GPT-4 have introduced new capabilities, but there’s still a place for task-specific trained models, especially in niche domains.
Explosion has focused on staying true to the original purpose of Spacey while adapting to new developments in NLP, such as the introduction of Spacey LLM for prototyping.
The belief in custom models and transparent tools is rooted in the idea that long-term project success depends on the ability to improve consistently over time, which is facilitated by open source software.
The future of NLP models will likely involve a mix of smaller, task-specific models for machine-facing tasks and the use of LLMs to improve the process of creating these classifiers.
Multimodality is becoming more feasible and important in NLP, particularly in understanding and processing formatted documents, which is a significant business need.

Join our upcoming Leading with Data sessions for insightful discussions with AI and Data Science leaders!

Let’s look into the details of our conversation with Dr. Matthew Honnibal!

How has the domain of NLP evolved since 2019, and what has been the impact on Spacey?

In the last few years, NLP has seen significant advancements, particularly with the advent of deep learning and pre-trained transformers like BERT. These models have revolutionized the field by effectively utilizing unlabeled data, allowing for fewer examples to train task-specific models. This shift has been a game-changer, as it enables models to start with some knowledge of language before applying it to a task, rather than learning everything from scratch.

Spacey has continued to serve the needs it was designed for, despite the emergence of new technologies like large language models (LLMs). The library has remained relevant and its use cases have only grown as more people delve into NLP. We’ve stayed true to our roots, focusing on solving real NLP problems and ensuring that Spacey evolves alongside the field without deviating from its original purpose.

What are your thoughts on the current LLMs like ChatGPT and GPT-4?

Initially, I was skeptical about the potential of LLMs, but their success has been undeniable. However, it’s still unclear what direction things will take. While in-context learning has its advantages, especially as a prototyping tool, there’s still a significant technical benefit to training models for classification problems. The more niche a domain, the better the outcome of a trained model over in-context learning. It’s not just about the domain but also the task. For instance, in-context learning may not be as effective for tasks with many labels or nonarbitrary tasks.

How has Explosion evolved during this period, and what are the key focus areas?

Explosion has seen a lot of changes, including the pandemic and the growth of AI technologies. We’ve maintained our commitment to using the tools we develop and solving real NLP problems. Consulting has been an integral part of our business, allowing us to stay in touch with real-world applications and test new methods. Spacey LLM, our latest initiative, encapsulates the process of prompting an LLM, annotating a Spacey doc object, and allowing users to replace the LLM-powered module with a trained model if desired. It’s particularly useful for prototyping and working alongside rule-based classifiers.

Can you elaborate on the belief that the best AI products require custom models and transparent tools?

The belief that developers need custom models and transparent tools stems from the idea that ease of starting isn’t the only factor that matters in AI development. What’s crucial is the ability to invest more time and effort into a project to consistently improve it. Open source software has been successful because it offers predictability and the ability to build a mental model of what you’re developing against, as opposed to vendor solutions that may hit walls as you progress.

What does the future hold for NLP models in terms of size and use cases?

I believe that smaller, task-specific models will continue to be important, especially for machine-facing tasks. The feasibility of running all classifiers at the scale of GPT-4 is doubtful due to resource constraints. However, LLMs will play a significant role in improving the efficiency of creating classifiers, especially in data annotation and understanding training issues. We’ll also see more applications that connect machine-facing outputs to human-facing outputs in rich and interesting ways.

How do you see multimodality influencing NLP?

Multimodal tasks are becoming increasingly feasible with larger-scale models. While truly multimodal tasks combining text and image are rarer in business, understanding formatted documents, including tables and figures, is a significant part of the business need for NLP. Better capabilities in this area are crucial, and I expect continued improvement in handling formatted text and numbers.

Summing-up

Matthew Honnibal’s insights in this episode underscore the dynamic evolution of NLP, highlighting the profound impact of deep learning and pre-trained transformers. His balanced view on the coexistence of large language models and task-specific models emphasizes the nuanced approach needed for different NLP applications. Explosion AI’s continued innovation, particularly with the introduction of spaCy LLM, showcases their commitment to practical solutions and real-world impact. As we look to the future, Matthew’s belief in the importance of custom models and transparent tools serves as a guiding principle for sustainable AI development, ensuring adaptability and continuous improvement in the field of NLP.

For more engaging sessions on AI, data science, and GenAI, stay tuned with us on Leading with Data.

Check our upcoming sessions here.

Nitika Sharma

Hello, I am Nitika, a tech-savvy Content Creator and Marketer. Creativity and learning new things come naturally to me. I have expertise in creating result-driven content strategies. I am well versed in SEO Management, Keyword Operations, Web Content Writing, Communication, Content Strategy, Editing, and Writing.

Leading with Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Exploring the AI Nexus with Matthew Honnibal

Key Insights from our Conversation with Matthew Honnibal

How has the domain of NLP evolved since 2019, and what has been the impact on Spacey?

What are your thoughts on the current LLMs like ChatGPT and GPT-4?

How has Explosion evolved during this period, and what are the key focus areas?

Can you elaborate on the belief that the best AI products require custom models and transparent tools?

What does the future hold for NLP models in terms of size and use cases?

How do you see multimodality influencing NLP?

Summing-up

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit