Microsoft’s 1-bit LLMs Explained

NISHANT TIWARI 15 Mar, 2024 • 5 min read

Introduction

In recent years, Large Language Models (LLMs) have undergone a tremendous expansion in both their size and functionality. However, these developments have raised deployment challenges and concerns about environmental footprint and economic cost due to high energy consumption. This is where Microsoft’s research on 1-bit LLMs stands significant. 1-bit model architectures such as BitNet provide an outlook that promises to control the expenditure associated with LLMs while sustaining their effectiveness. Notably, a particular variant of one-bit LLM, BitNet b1.58, has demonstrated some great promise. It brings improvement while marking a new era of efficiency in computing, potentially enabling hardware specially designed for 1-bit LLMs. This article discusses the advancements and implications of 1-bit LLMs, especially the BitNet b1.58 model.

The Era of 1-bit LLMs
BitNet b1.58
Breaking the Length Barrier
LLMs on the Edge & Mobile Devices
New Hardware for 1-bit LLMs
Training with 2 Trillion Tokens
Discussion and Future Work

The Era of 1-bit LLMs

In recent years, AI has experienced a significant surge in the size and capabilities of LLMs. These models have demonstrated exceptional performance across a wide range of natural language processing (NLP) tasks. However, their increasing size has posed challenges for deployment and raised concerns about their environmental and economic impact due to high energy consumption. To address these challenges, recent research has paved the way for a new era of 1-bit LLMs.

One approach to reduce the cost of LLMs while maintaining their performance is through post-training quantization. This creates low-bit models for inference. However, this technique is sub-optimal, leading to the exploration of 1-bit model architectures, such as BitNet.

BitNet b1.58, a significant 1-bit LLM variant, introduces a new scaling law and the recipe for training new generations of LLMs that are both high-performance and cost-effective. It matches the full-precision Transformer LLM in terms of both perplexity and end-task performance while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. This marks a pivotal shift in the landscape of language models, offering a promising direction for the future of LLMs.

BitNet b1.58

BitNet b1.58 is a significant 1-bit LLM variant that introduces a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs. It is based on the BitNet architecture, a Transformer that replaces nn.Linear with BitLinear. This architecture is trained from scratch, with 1.58-bit weights and 8-bit activations.

BitNet b1.58 preserves the advantages of its predecessor, the original 1-bit BitNet, including its innovative computation paradigm, which minimizes multiplication operations for matrix multiplication, resulting in highly optimized performance. Moreover, it maintains the same energy efficiency as the original model while significantly enhancing memory consumption, throughput, and latency compared to FP16 LLM baselines.

Additionally, BitNet b1.58 introduces two notable enhancements. Firstly, its modeling capability is bolstered by explicit support for feature filtering, achieved through the inclusion of 0 in the model weights. This feature enhances the performance of 1-bit LLMs significantly. Secondly, experimental results demonstrate that BitNet b1.58 can rival full precision (FP16) baselines in terms of both perplexity and end-task performance, even at a 3B size, under identical configurations. This demonstrates the potential of BitNet b1.58 to redefine the scaling law and offer a Pareto improvement over existing LLM models.

Breaking the Length Barrier

In the era of LLMs, the ability to handle long sequences has become a critical demand. BitNet b1.58 represents a significant step towards native support for long sequences. It reduces the activations from 16 bits to 8 bits, effectively doubling the context length with the same resources. This advancement is crucial as it addresses the challenge of memory consumption introduced by the KV caches during long sequence inference.

Furthermore, the 1.58-bit LLMs can be further losslessly compressed to 4 bits or even lower, presenting an avenue for future work. By enabling native support for long sequences, BitNet b1.58 offers a solution to one of the major challenges faced in the era of LLMs. It paves the way for more efficient and effective processing of extended sequences in NLP tasks.

LLMs on the Edge & Mobile Devices

The deployment of 1.58-bit LLMs can significantly enhance the performance of language models on edge and mobile devices. These devices often get constrained by limited memory and computational power, which restrict the performance and scale of LLMs. However, the reduced memory and energy consumption of 1.58-bit LLMs makes them suitable for deployment on these devices. It unlocks a wide range of applications that were previously not feasible.

long sequence, new hardware, and mobile devices.

This advancement enhances the capabilities of edge and mobile devices and enables the development of new and innovative applications of LLMs in these contexts. Additionally, 1.58-bit LLMs are more compatible with CPU devices, which are the primary processors used in edge and mobile devices. This compatibility allows to efficiently execute BitNet b1.58 on these devices, further improving their performance and capabilities in NLP tasks.

New Hardware for 1-bit LLMs

Let’s now discuss the need for specific hardware optimized for 1-bit LLMs and calls for actions to design such hardware. The emergence of 1-bit LLMs, exemplified by BitNet b1.58, presents a new computation paradigm that demands specialized hardware to fully leverage its potential. Recent work, such as Groq5, has demonstrated promising results and potential for building specific hardware, such as LPUs, for LLMs. It envisions and advocates for the design of new hardware and systems specifically optimized for 1-bit LLMs, given the new computation paradigm enabled in BitNet b1.58.

The need for new hardware arises from the unique characteristics of 1-bit LLMs, including their reduced memory and energy consumption, as well as their potential to significantly improve the performance of language models on edge and mobile devices. These devices, often constrained by limited memory and computational power, can benefit from the deployment of 1-bit LLMs. Furthermore, the compatibility of 1-bit LLMs with CPU devices, commonly used in edge and mobile devices, ensures efficient execution, further enhancing their performance and capabilities in processing natural language tasks.

Training with 2 Trillion Tokens

Training with a large number of tokens is a crucial factor for LLMs. To assess the scalability of BitNet b1.58 in terms of tokens, a model was trained with 2 trillion tokens, following the data recipe of StableLM-3B. This state-of-the-art open-source 3B model was evaluated on a benchmark consisting of Winogrande, PIQA, SciQ, LAMBADA, and ARC-easy.

The zero-shot accuracy results (shown above) indicate that BitNet b1.58 outperforms StableLM-3B at 2 trillion tokens across all end tasks. This demonstrates the superior performance and strong generalization capabilities of 1.58-bit LLMs. It highlights their potential for handling extensive training data and achieving high accuracy across diverse NLP tasks.

Discussion and Future Work

Let’s now look into the potential applications and advancements in the field of 1-bit LLMs. The emergence of 1-bit Mixture-of-Experts (MoE) LLMs as a cost-effective approach, reduces computation FLOPs while addressing challenges related to memory consumption and inter-chip communication overhead.

The reduced memory footprint of 1.58-bit LLMs is one of the key factors in addressing these challenges. It potentially reduces the number of devices required for deployment and minimizes the overhead of transferring activations across networks. Additionally, there arises a need for specific hardware optimized for 1-bit LLMs. This calls for designing new hardware and systems tailored to the new computation paradigm enabled by BitNet b1.58. It further highlights the potential for significant advancements in performance, efficiency, and applicability across various computing platforms. This in turn positions 1-bit LLMs as a pivotal innovation in the field of large language models.

Conclusion

The rise of 1-bit LLMs, exemplified by BitNet b1.58, represents a significant leap forward in AI. These models offer efficient solutions to deployment challenges and environmental concerns. They even outperform full-precision models while reducing memory usage and computational costs.

BitNet b1.58’s emergence paves the way for specialized hardware tailored to 1-bit LLMs, especially beneficial for edge and mobile devices. Its scalability, demonstrated by training with 2 trillion tokens, underscores its versatility across diverse language tasks. It signifies a transformative shift in NLP, promising enhanced efficiency and effectiveness with ongoing advancements in hardware and model design.

NISHANT TIWARI 15 Mar 2024

Intermediate Large Language Models LLMs