Semantics is important because in NLP it is the relationships between the words that are being studied. One of the simplest yet highly effective procedure is Continuous Bag of Words (CBOW) which maps words to highly meaningful vectors called word vectors. CBOW is used in the Word2Vec framework and predicts a word based on the words that are adjacent to it which captures the semantic as well as syntactic meaning of language. In this article, the reader will learn about the operation of the CBOW model, as well as the methods of its use.
The Continuous Bag of Words (CBOW) is also a model that is used when determining word embedding using a neural network and is part of Word2Vec models by Tomas Mikolov. CBOW tries to predict a target word depending on the context words observing it in a given sentence. This way it is able to capture the semantic relations hence close words are represented closely in a high dimensional space.
For example, in the sentence “The cat sat on the mat”, if the context window size is 2, the context words for “sat” are [“The”, “cat”, “on”, “the”], and the model’s task is to predict the word “sat”.
CBOW operates by aggregating the context words (e.g., averaging their embeddings) and using this aggregate representation to predict the target word. The model’s architecture involves an input layer for the context words, a hidden layer for embedding generation, and an output layer to predict the target word using a probability distribution.
It is a fast and efficient model suitable for handling frequent words, making it ideal for tasks requiring semantic understanding, such as text classification, recommendation systems, and sentiment analysis.
CBOW is one of the simplest, yet efficient techniques as per context for word embedding where the whole vocabulary of words are mapped to vectors. This section also describes the operation of the CBOW system as a means of comprehending the method at its most basic level, discussing the main ideas that underpin the CBOW method, as well as offering a comprehensive guide to the architectural layout of the CBOW hit calculation system.
CBOW relies on two key concepts: context words and the target word.
By analyzing the relationship between context and target words across large corpora, CBOW generates embeddings that capture semantic relationships between words.
Here’s a breakdown of how CBOW works, step-by-step:
Convert the context words and target word into one-hot vectors based on the vocabulary size. For a vocabulary of size 5, the one-hot representation of the word “love” might look like [0, 1, 0, 0, 0].
Pass the one-hot encoded context words through an embedding layer. This layer maps each word to a dense vector representation, typically of a lower dimension than the vocabulary size.
Aggregate the embeddings of all context words (e.g., by averaging or summing them) to form a single context vector.
Repeat the process for all context-target pairs in the corpus until the model converges.
The Continuous Bag of Words (CBOW) model’s architecture is designed to predict a target word based on its surrounding context words. It is a shallow neural network with a straightforward yet effective structure. The CBOW architecture consists of the following components:
Input:
Sentence: “I love machine learning”, target word: “machine”, context words: [“I”, “love”, “learning”].
One-Hot Encoding:
Vocabulary: [“I”, “love”, “machine”, “learning”, “AI”]
Embedding Layer:
Embeddings:
Aggregation:
Output Layer:
Input Layer: ["I", "love", "learning"]
--> One-hot encoding
--> Embedding Layer
--> Dense embeddings
--> Aggregated context vector
--> Fully connected layer + Softmax
Output: Predicted word "machine"
We’ll now walk through implementing the CBOW model from scratch in Python.
The first spike is to transform the text into tokens, words that are generated into context-target pairs with context as the words containing the target word.
corpus = "The quick brown fox jumps over the lazy dog"
corpus = corpus.lower().split() # Tokenization and lowercase conversion
# Define context window size
C = 2
context_target_pairs = []
# Generate context-target pairs
for i in range(C, len(corpus) - C):
context = corpus[i - C:i] + corpus[i + 1:i + C + 1]
target = corpus[i]
context_target_pairs.append((context, target))
print("Context-Target Pairs:", context_target_pairs)
Output:
Context-Target Pairs: [(['the', 'quick', 'fox', 'jumps'], 'brown'), (['quick', 'brown', 'jumps', 'over'], 'fox'), (['brown', 'fox', 'over', 'the'], 'jumps'), (['fox', 'jumps', 'the', 'lazy'], 'over'), (['jumps', 'over', 'lazy', 'dog'], 'the')]
We build a vocabulary (a unique set of words), then map each word to a unique index and vice versa for efficient lookups during training.
# Create vocabulary and map each word to an index
vocab = set(corpus)
word_to_index = {word: idx for idx, word in enumerate(vocab)}
index_to_word = {idx: word for word, idx in word_to_index.items()}
print("Word to Index Dictionary:", word_to_index)
Output:
Word to Index Dictionary: {'brown': 0, 'dog': 1, 'quick': 2, 'jumps': 3, 'fox': 4, 'over': 5, 'the': 6, 'lazy': 7}
One-hot encoding works by transforming each word in the word formation system into a vector, where the indicator of the word is ‘1’ while the rest of the places take ‘0,’ for reasons that shall soon be clear.
def one_hot_encode(word, word_to_index):
one_hot = np.zeros(len(word_to_index))
one_hot[word_to_index[word]] = 1
return one_hot
# Example usage for a word "quick"
context_one_hot = [one_hot_encode(word, word_to_index) for word in ['the', 'quick']]
print("One-Hot Encoding for 'quick':", context_one_hot[1])
Output:
One-Hot Encoding for 'quick': [0. 0. 1. 0. 0. 0. 0. 0.]
In this step, we create a basic neural network with two layers: one for word embeddings and another to compute the output based on context words, averaging the context and passing it through the network.
class CBOW:
def __init__(self, vocab_size, embedding_dim):
# Randomly initialize weights for the embedding and output layers
self.W1 = np.random.randn(vocab_size, embedding_dim)
self.W2 = np.random.randn(embedding_dim, vocab_size)
def forward(self, context_words):
# Calculate the hidden layer (average of context words)
h = np.mean(context_words, axis=0)
# Calculate the output layer (softmax probabilities)
output = np.dot(h, self.W2)
return output
def backward(self, context_words, target_word, learning_rate=0.01):
# Forward pass
h = np.mean(context_words, axis=0)
output = np.dot(h, self.W2)
# Calculate error and gradients
error = target_word - output
self.W2 += learning_rate * np.outer(h, error)
self.W1 += learning_rate * np.outer(context_words, error)
# Example of creating a CBOW object
vocab_size = len(word_to_index)
embedding_dim = 5 # Let's assume 5-dimensional embeddings
cbow_model = CBOW(vocab_size, embedding_dim)
# Using random context words and target (as an example)
context_words = [one_hot_encode(word, word_to_index) for word in ['the', 'quick', 'fox', 'jumps']]
context_words = np.array(context_words)
context_words = np.mean(context_words, axis=0) # average context words
target_word = one_hot_encode('brown', word_to_index)
# Forward pass through the CBOW model
output = cbow_model.forward(context_words)
print("Output of CBOW forward pass:", output)
Output:
Output of CBOW forward pass: [[-0.20435729 -0.23851241 -0.08105261 -0.14251447 0.20442154 0.14336586
-0.06523201 0.0255063 ]
[-0.0192184 -0.12958821 0.1019369 0.11101922 -0.17773069 -0.02340574
-0.22222151 -0.23863179]
[ 0.21221977 -0.15263454 -0.015248 0.27618767 0.02959409 0.21777961
0.16619577 -0.20560026]
[ 0.05354038 0.06903295 0.0592706 -0.13509918 -0.00439649 0.18007843
0.1611929 0.2449023 ]
[ 0.01092826 0.19643582 -0.07430934 -0.16443165 -0.01094085 -0.27452367
-0.13747784 0.31185284]]
TensorFlow simplifies the process by defining a neural network that uses an embedding layer to learn word representations and a dense layer for output, using context words to predict a target word.
import tensorflow as tf
# Define a simple CBOW model using TensorFlow
class CBOWModel(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim):
super(CBOWModel, self).__init__()
self.embeddings = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)
self.output_layer = tf.keras.layers.Dense(vocab_size, activation='softmax')
def call(self, context_words):
embedded_context = self.embeddings(context_words)
context_avg = tf.reduce_mean(embedded_context, axis=1)
output = self.output_layer(context_avg)
return output
# Example usage
model = CBOWModel(vocab_size=8, embedding_dim=5)
context_input = np.random.randint(0, 8, size=(1, 4)) # Random context input
context_input = tf.convert_to_tensor(context_input, dtype=tf.int32)
# Forward pass
output = model(context_input)
print("Output of TensorFlow CBOW model:", output.numpy())
Output:
Output of TensorFlow CBOW model: [[0.12362909 0.12616573 0.12758036 0.12601459 0.12477358 0.1237749
0.12319998 0.12486169]]
Gensim offers ready-made implementation of CBOW in the Word2Vec() function where one does not need to labor on training as Gensim trains word embeddings from a corpus of text.
import gensim
from gensim.models import Word2Vec
# Prepare data (list of lists of words)
corpus = [["the", "quick", "brown", "fox"], ["jumps", "over", "the", "lazy", "dog"]]
# Train the Word2Vec model using CBOW
model = Word2Vec(corpus, vector_size=5, window=2, min_count=1, sg=0)
# Get the vector representation of a word
vector = model.wv['fox']
print("Vector representation of 'fox':", vector)
Output:
Vector representation of 'fox': [-0.06810732 -0.01892803 0.11537147 -0.15043275 -0.07872207]
We will now explore advantages of continuous bag of words:
Let us now discuss the limitations of CBOW:
The Continuous Bag of Words (CBOW) model has proven to be an efficient and intuitive approach for generating word embeddings by leveraging surrounding context. Through its simple yet effective architecture, CBOW bridges the gap between raw text and meaningful vector representations, enabling a wide range of NLP applications. By understanding CBOW’s working mechanism, its strengths, and limitations, we gain deeper insights into the evolution of NLP techniques. With its foundational role in embedding generation, CBOW continues to be a stepping stone for exploring advanced language models.
A: CBOW predicts a target word using context words, while Skip-Gram predicts context words using the target word.
A: CBOW processes multiple context words simultaneously, while Skip-Gram evaluates each context word independently.
A: No, Skip-Gram is generally better at learning representations for rare words.
A: The embedding layer transforms sparse one-hot vectors into dense representations, capturing word semantics.
A: Yes, while newer models like BERT exist, CBOW remains a foundational concept in word embeddings.