Jamba 1.5: Featuring the Hybrid Mamba-Transformer Architecture

Mounish V Last Updated : 04 Nov, 2024

5 min read

Jamba 1.5 is an instruction-tuned large language model that comes in two versions: Jamba 1.5 Large with 94 billion active parameters and Jamba 1.5 Mini with 12 billion active parameters. It combines the Mamba Structured State Space Model (SSM) with the traditional Transformer architecture. This model, developed by AI21 Labs, can process a 256K effective context window, which is the largest among open-source models.

Overview

Jamba 1.5 a hybrid Mamba-Transformer model for efficient NLP, capable of processing massive context windows with up to 256K tokens.
Its 94B and 12B parameter versions enable diverse language tasks while optimizing memory and speed through the ExpertsInt8 quantization.
AI21’s Jamba 1.5 combines scalability and accessibility, supporting tasks from summarization to question-answering across nine languages.
It’s innovative architecture allows for long-context handling and high efficiency, making it ideal for memory-heavy NLP applications.
It’s hybrid model architecture and high-throughput design offer versatile NLP capabilities, available through API access and on Hugging Face.

Overview
What are Jamba 1.5 Models?
The Architecture of Jamba 1.5
- Explanation
Intended Use and Accessibility
Jamba 1.5
- Chat Interface
- Jamba 1.5 using Python
Conclusion
Frequently Asked Questions

What are Jamba 1.5 Models?

The Jamba 1.5 models, including Mini and Large variants, are designed to handle various natural language processing (NLP) tasks such as question answering, summarization, text generation, and classification. Jamba models on an extensive corpus support nine languages—English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew. Jamba 1.5, with its joint SSM-Transformer structure, tackles the problems with the conventional transformer models that are often hindered by two major limitations: high memory requirements for long context windows and slower processing.

The Architecture of Jamba 1.5

Aspect	Details
Base Architecture	Hybrid Transformer-Mamba architecture with a Mixture-of-Experts (MoE) module
Model Variants	Jamba-1.5-Large (94B active parameters, 398B total) and Jamba-1.5-Mini (12B active parameters, 52B total)
Layer Composition	9 blocks, each with 8 layers; 1:7 ratio of Transformer attention layers to Mamba layers
Mixture of Experts (MoE)	16 experts, selecting the top 2 per token for dynamic specialization
Hidden Dimensions	8192 hidden state size
Attention Heads	64 query heads, 8 key-value heads
Context Length	Supports up to 256K tokens, optimized for memory with significantly reduced KV cache memory
Quantization Technique	ExpertsInt8 for MoE and MLP layers, allowing efficient use of INT8 while maintaining high throughput
Activation Function	Integration of Transformer and Mamba activations, with an auxiliary loss to stabilize activation magnitudes
Efficiency	Designed for high throughput and low latency, optimized to run on 8x80GB GPUs with 256K context support

Explanation

KV cache memory is memory allocated for storing key-value pairs from previous tokens, optimizing speed when handling long sequences.
ExpertsInt8 quantization is a compression method using INT8 precision in MoE and MLP layers to save memory and improve processing speed.
Attention heads are separate mechanisms within the attention layer that focus on different parts of the input sequence, improving model understanding.
Mixture-of-Experts (MoE) is a modular approach where only selected expert sub-models process each input, boosting efficiency and specialization.

Intended Use and Accessibility

Jamba 1.5 was designed for a range of applications accessible via AI21’s Studio API, Hugging Face or cloud partners, making it deployable in various environments. For tasks such as sentiment analysis, summarization, paraphrasing, and more. It can also be finetuned on domain-specific data for better results; the model can be downloaded from Hugging Face.

Jamba 1.5

One way to access them is by using AI21’s Chat interface:

Chat Interface

Here’s the link: Chat Interface

This is just a small sample of the model’s question-answering capabilities.

Jamba 1.5 using Python

You can send requests and get responses from Jamba 1.5 in Python using the API Key.

To get your API key, click on settings on the left bar of the homepage, then click on the API key.

Note: You’ll get $10 free credits, and you can track the credits you use by clicking on ‘Usage’ in the settings.

Installation

!pip install ai21

Python Code

from ai21 import AI21Client
from ai21.models.chat import ChatMessage
messages = [ChatMessage(content="What's a tokenizer in 2-3 lines?", role="user")]
client = AI21Client(api_key='')
response = client.chat.completions.create(
  messages=messages,
  model="jamba-1.5-mini",
  stream=True
)
for chunk in response:
  print(chunk.choices[0].delta.content, end="")

A tokenizer is a tool that breaks down text into smaller units called tokens, words, subwords, or characters. It is essential for natural language processing tasks, as it prepares text for analysis by models.

It’s straightforward: We send the message to our desired model and get the response using our API key.

Note: You can also choose to use the jamba-1.5-large model instead of Jamba-1.5-mini

Conclusion

Jamba 1.5 blends the strengths of the Mamba and Transformer architectures. With its scalable design, high throughput, and extensive context handling, it is well-suited for diverse applications ranging from summarization to sentiment analysis. By offering accessible integration options and optimized efficiency, it enables users to work effectively with its modelling capabilities across various environments. It can also be finetuned on domain-specific data for better results.

Frequently Asked Questions

Q1. What is Jamba 1.5?

Ans. Jamba 1.5 is a family of large language models designed with a hybrid architecture combining Transformer and Mamba elements. It includes two versions, Jamba-1.5-Large (94B active parameters) and Jamba-1.5-Mini (12B active parameters), optimized for instruction-following and conversational tasks.

Q2. What makes Jamba 1.5 efficient for long-context processing?

Ans. Jamba 1.5 models support an effective context length of 256K tokens, made possible by its hybrid architecture and an innovative quantization technique, ExpertsInt8. This efficiency allows the models to manage long-context data with reduced memory usage.

Q3. What is the ExpertsInt8 quantization technique in Jamba 1.5?

Ans. ExpertsInt8 is a custom quantization method that compresses model weights in the MoE and MLP layers to INT8 format. This technique reduces memory usage while maintaining model quality and is compatible with A100 GPUs, enhancing serving efficiency.

Q4. Is Jamba 1.5 available for public use?

Ans. Yes, both Large and Mini are publicly available under the Jamba Open Model License. The models can be accessed on Hugging Face.

Mounish V

Passionate about technology and innovation, a graduate of Vellore Institute of Technology. Currently working as a Data Science Trainee, focusing on Data Science. Deeply interested in Deep Learning and Generative AI, eager to explore cutting-edge techniques to solve complex problems and create impactful solutions.

Advanced NLP Python Transformer Models

Free Courses

Build a Document Retriever Search Engine with LangChain

Learn to create a document retrieval search engine using LangChain.

4.6

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Build a ChatGPT-style language model using PyTorch.

4.5

Naive Bayes from Scratch

Master Naïve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

Reading list

Jamba 1.5: Featuring the Hybrid Mamba-Transformer Architecture

Overview

Table of contents

What are Jamba 1.5 Models?