Unleashing the Potential of Domain-Specific LLMs

Deepanjan Kundu 12 Sep, 2023 • 9 min read

Introduction

Large Language Models (LLMs) have changed the entire world. Especially in the AI community, this is a giant leap forward. Building a system that can understand and reply to any text was unthinkable a few years ago. However, these capabilities come at the cost of missing depth. Generalist LLMs are jacks of all trades but masters of none. For domains that require depth and precision, flaws like hallucinations can be costly. Does that mean domains like medicine, finance, engineering, legal, etc., will never reap the benefits of LLM? Experts have already started building dedicated domain-specific LLMs for such areas, which leverage the same underlying techniques like self-supervised learning and RLHF. This article explores domain-specific LLMs and their capability to yield better results.

Learning Objectives

Before we dive into the technical details, let us outline the learning objectives of this article:

  • Learn the concept of Large language models, aka LLMs, and understand their strengths and benefits.
  • Know more about the limitations of popular generalist LLMs
  • Find out what domain-specific LLMs are and how they can help solve the limitations of generalist LLMs
  • Explore different techniques for building domain-specific language models with examples to show their benefits on the performance in fields such as legal, code-completion, finance, and bio-medicine.

This article was published as a part of the Data Science Blogathon.

What are LLMs?

A large language model, or LLM, is an artificial intelligence system that contains hundreds of millions to billions of parameters and is built to understand and generate text. The training involves exposing the model to many sentences from internet text, including books, articles, websites, and other written materials, and teaching it to predict the masked words or the following words in the sentences. By doing so, the model learns the statistical patterns and linguistic relationships in the text it has been trained on. They can be used for various tasks, including language translation, text summarization, question answering, content generation, and more. Since the invention of transformers, countless LLMs have been built and published. Some examples of recently popular LLMs are Chat GPTGPT-4LLAMA, and Stanford Alpaca, which have achieved groundbreaking performances.

Strength of LLMs

LLMs have become the go-to solution for language understanding, entity recognition, language generation problems, and more. Stellar performances on standardized evaluation datasets like GLUE, Super GLUE, SQuAD, and BIG benchmarks reflect this achievement. When released, BERT, T5, GPT-3, PALM, and GPT-4 all delivered state-of-the-art results on these standardized tests. GPT-4 scored more on the BAR and SATs than any average human. The chart (Figure 1) below shows the significant improvement in the GLUE benchmark since the advent of large language models.

strength of LLMs | Domain-Specific LLMs

Another major advantage large language models have is their improved multilingual capabilities. For example, the multilingual BERT model, trained in 104 languages, has shown great zero-shot and few-shot results across different languages. Moreover, the cost of leveraging LLMs has become relatively low. Low-cost methods like prompt design and prompt tuning have come up, which ensure that engineers can easily leverage existing LLMs at a meager cost. Hence, large language models have become the default option for language-based tasks, including language understanding, entity recognition, translation, and more.

Limitations of Generalist LLMs

Most popular LLMs, like the ones mentioned above, trained on various text resources from the web, books, Wikipedia, and more, are called generalist LLMs. There have been multiple applications for these LLMs ranging from search assistant (Bing Chat using GPT-4, BARD using PALM); content generation tasks like writing marketing emails, marketing content, and sales pitches; question and answering tasks like personal chatbots, customer service chatbots, etc.

Although generalist AI models have shown great skills in understanding and generating text over various topics, they sometimes need more depth and nuance for specialized areas. For example, “bonds” are a form of borrowing in the finance industry. However, a general language model may not understand this unique phrase and confuse it with bonds from chemistry or between two humans. On the other hand, domain-specific LLMs have a specialized understanding of terminology related to specific use cases to interpret industry-specific ideas properly.

Moreover, generalist LLMs have multiple privacy challenges. For example, in the case of medical LLMs, patient data is highly critical, and exposure of such confidential data to generic LLMs could violate privacy agreements due to techniques like RLHF. Domain-specific LLMs, on the other hand, ensure a closed framework to avoid the leak of any data.

Similarly, generalist LLMs have been prone to significant hallucinations as they are often catered heavily to creative writing. Domain-specific LLMs are more precise and perform significantly better on their field-specific benchmarks, as seen in the use cases below.

Domain-specific LLMs

LLMs that are trained on domain-specific data are called domain-specific LLMs. The term domain covers anything from a specific field, like medicine, finance, etc., to a specific product, like YouTube Comments. A domain-specific LLM aims to perform best on domain-specific benchmarks; generic benchmarks are no longer critical. There are multiple ways to build dedicated language models. The most popular approach is fine-tuning an existing LLM to domain-specific data. However, pre-training is the way to go for use cases striving to achieve state-of-the-art performances in a niche domain.

Fine-Tuning vs. Pre-training

Tuning existing LLMs to a particular domain can greatly improve the process of developing language models fine-tuned to that domain. In fine-tuning, the model uses the knowledge encoded during pre-training to tweak those parameters based on domain-specific data. Fine-tuning requires less training time and labeled data. Because of its inexpensive cost, this has been the popular approach for domain-specific LLMs. However, fine-tuning could have severe performance limitations, especially for niche domains. Let us understand this with a simple example of the BERT model built for legal language understanding (paper). Two pre-trained models are used: BERT-base and Custom Legal-BERT. As shown in the image below, a BERT-base model fine-tuned on legal tasks severely outperforms a Custom Legal-BERT model fine-tuned on legal tasks.

"

The above example clearly exhibits the power of domain-specific pre-training over fine-tuning in niche areas like law. Fine-tuning generic language models is helpful for more generalized language problems, but niche problem areas would do much better by using pre-trained LLMs. The following sections explain different pre-training approaches and give an example of each approach and its success.

Domain Specific Pre-training

Pre-training a language model using a large-sized dataset carefully selected or created to be aligned with a specific field is called domain-specific pre-training. Models can learn domain-specific knowledge, for example, terminology, concepts, and subtleties unique to that field, by being trained on domain-specific data. It helps models learn about a field’s unique requirements, language, and context, producing predictions or replies that are more accurate and contextually appropriate. This enhances the model’s understanding of the target field and improves the precision of its generative capabilities. There are multiple ways to use domain-specific data for pre-training for LLMs. Here are a few of them:

Approach 1

Use only domain-specific data instead of general data for pre-training the model on self-supervised language modeling tasks. This way, the model will learn domain-specific knowledge. The domain-specific LLM can then be fine-tuned for the required task to build the task-specific model. This is the simplest way to pre-train a domain-specific LLM. A figure shows the flow for using only domain-specific data for self-supervised learning to build the domain-specific LLM.

approach 1 | Domain-Specific LLMs

Example: StarCoderBase

StarCoderBase is a Large Language Model for Code (Code LLMs) trained using permissively licensed data from GitHub, including 80+ programming languages, Git commits, and Jupyter notebooks. It is a 1 trillion token 15B parameter model. StarCoderBase beat the most significant models, including PaLM, LaMDA, and LLaMA, while being substantially smaller, illustrating the usefulness of domain-specialized LLMs. (Image from StarCoder Paper)

"

Approach 2

Combine domain-specific data with general data for pre-training the model on self-supervised language modeling tasks. This way, the model will learn domain-specific knowledge and utilize the general language pre-training to improve language understanding. Here is a figure showing the flow for using only domain-specific data and general corpora for self-supervised learning for building the domain-specific LLM, which can then be fine-tuned for domain-specific tasks.

approach 2 | Domain-Specific LLMs

Example: Bloomberg GPT

Bloomberg GPT is a finance domain LLM trained on an extensive archive of financial data, including a 363 billion token dataset of English financial papers. This data was supplemented with a public dataset of 345 billion tokens to generate a massive training corpus of over 700 billion tokens. The researchers built a 50-billion parameter decoder-only causal language model using a subset of this training dataset. Notably, the BloombergGPT model surpassed current open models of a similar scale by a vast amount on financial-specific NLP benchmarks. The chart below shows Bloomberg GPT’s performance comparison on finance-specific NLP tasks. Source: Bloomberg.

"

Approach 3

Build or use a pre-trained generic LLM and cold start on its parameters. Run the language modeling self-supervised tasks using domain-specific data on top of the cold-started generic LLM to build the domain-specific LLM, which can then be fine-tuned for the required task to build the task-specific model. This leverages transfer learning from the generic LLM by cold starting on the generic LLM. Here is a figure showing the flow for step-by-step self-supervised learning, first using general and then domain-specific corpora for building the domain-specific LLM.

Domain-Specific LLMs

Example: BioBERT

BioBERT (Lee et al., 2019) is built on the BERT-base model (Devlin et al., 2019), with extra bio-medical domain pre-training. This model was trained for 200K steps on Pub Med and 270K steps on PMC, followed by 1M steps on the Pub Med dataset. BioBERT beats BERT and earlier state-of-the-art models in biomedical text-based tasks when pre-trained on biomedical corpora while having almost the same architecture across tasks. BioBERT outperforms BERT on three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement), and biomedical question answering. (12.24% MRR improvement).

Advantages of Domain-Specific Pre-trained LLMs

The examples above illustrate the power of pre-training a language model in a specific domain. The techniques listed can significantly improve performance on tasks in that domain. There are several advantages beyond performance improvements as well. Domain-specific LLMs eventually result in better user experiences. Another important advantage of domain-specific LLMs is reduced hallucination. A big problem with large-sized models is the possibility of hallucinations or inaccurate information generation. Domain-specific LLMs can prioritize precision in their replies and decrease hallucinations by restricting the spectrum of application cases. Another major benefit of domain-specific LLM is protecting sensitive or private information, a major issue for today’s businesses.

Conclusion

As more use cases adopt the LLMs for better performance and multilingual capabilities, it is worthwhile to start approaching new problems through the lens of LLMs. Moreover, the performance data listed in the sections above suggests that moving existing solutions to use LLM is a worthwhile investment. Running experiments with the approaches mentioned in this article will improve your chances of achieving your targets using domain-specific pre-training.

Key Takeaways

  • LLMs are powerful due to their strong zero-shot and few-shot learning performance, multilingual capabilities, adaptability to various use cases, and ease of utilizing them with low data.
  • However, generalist LLMs have limitations such as hallucination and low precision, lack of niche domain understanding, and potential privacy violations.
  • Domain-specific LLMs are the answer to these limitations. Pre-training custom large language models is better than fine-tuning them for the best performance results. When custom large language models are built for a particular domain, they perform much better and have high precision.
  • Domain-specific LLMs in niche fields such as legal, code-generation, finance, and bio-medicine have demonstrated that building niche foundational models does outperform generalist models in their respective field’s NLP benchmarks.

Frequently Asked Questions

Q1. What are Large language models (LLMs)?

A. Its size characterizes a large language model (LLM). AI accelerators enable their size by processing vast amounts of text data, mostly scraped from the Internet. They build them with artificial neural networks and transformer architecture, which can contain tens of millions up to billions of weights, and pre-train them using self-supervised and semi-supervised learning.

Q2. What are domain-specific LLMs?

A. Companies customize domain-specific LLMs for fields of interest, like legal, medicine, or finance. They outperform generic LLMs on field-specific benchmarks and may perform poorly on general language tasks.

Q3. How to build a domain-specific LLM?

A. One can build domain-specific LLMs from scratch by pre-training them on self-supervised tasks using domain-specific corpora. This process can also involve utilizing generic corpora independently, in combination, or sequentially. Alternatively, you can enhance the performance of generalist LLMs in a specific domain by fine-tuning them using domain-specific data. Despite the convenience, fine-tuning could have severe performance limitations, and pre-training a domain-specific model outperforms fine-tuning significantly for most use cases.

Q4. What are the benefits of domain-specific LLMs over generalist LLMs?

A. Key benefits of domain-specific LLMs are better performance on the target domain, fewer hallucinations, and better privacy protections.

Q5. What are some example use cases for domain-specific LLMs?

A. Some example applications of domain-specific LLMs covered in this article are Bio-BERT for bio-medicine, Custom Legal-BERT for Law, Bloomberg GPT for finance, and Star Coder for code-completion.

References

[1] Jinhyuk Lee and others, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics, Volume 36, Issue 4, February 2020

[2] Shijie Wu and others, BloombergGPT: A Large Language Model for Finance, 2023

[3] Raymond Li and Others, StarCoder: May the source be with you! 2023

[4] Jingqing Zhang and others, PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization, 2019

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT (1) 2019

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Deepanjan Kundu 12 Sep 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers