Imagine you’re tasked with crafting the perfect subject line for a crucial email campaign, but standing out in a crowded inbox seems daunting. This article offers a solution with a step-by-step guide to Smart Subject Email Line Generation with Word2Vec. Discover how to harness the power of Word2Vec embeddings to create compelling and contextually relevant subject lines that captivate and engage your audience. Follow along to transform your approach and elevate your email marketing strategy.
This article was published as a part of the Data Science Blogathon.
Word embeddings is a method which is used to represent words efficiently in a dense numerical format, where similar words have similar encodings. Unlike manually setting these encodings, embeddings are trainable parameters—floating point values learned by the model during training, similar to how weights are learned in a dense layer. Embeddings range from 8 for smaller datasets to larger dimensions like 1024 for extensive datasets allowing them to capture relationships between words. This higher dimensionality enables embeddings to encode detailed semantic relationships.
In a word embedding diagram, a 4-dimensional vector of floating-point values represents each word. Think of embeddings as a “lookup table” that stores each word’s dense vector after training, allowing you to quickly encode and retrieve words based on their vector representations.
Semantic similarity is the measure of how closely two pieces of text convey the same meaning. It allows systems to understand the different ways ideas can be expressed in language without needing to explicitly define each variation.
Word2Vec is a popular natural language processing technique for converting words into numerical vector representations.
Word2Vec generates word embedding which are continuous vector representations of words. Unlike traditional one hot encoding which represents words as sparse vectors Word2Vec maps each word to a dense vector of fixed size. These vectors capture semantic relationships between words allowing similar words to have similar vectors.
Word2Vec employs two main training approaches:
This method predicts a target word based on its surrounding context words. For example if a word is missing from a sentence CBOW tries to infer the missing word using the context provided by the other words in the sentence.
During training Word2Vec refines the word vectors by analyzing how frequently words appear together within a defined context window. Words with more comparable vectors are those that appear in similar contexts. Relationships like synonyms and analogies are well captured by this method (for example, the relationship between “king” and “queen” can be deduced from the analogy “king” – “man” + “queen” – “woman”).
Read more about Word2Vec here
Unlock the secrets to crafting compelling email subject lines with our step-by-step guide, leveraging Word2Vec embeddings for smarter, more relevant results.
Import essential libraries for data manipulation, natural language processing, word embeddings, and similarity calculations.
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
Download the NLTK tokenizer data required for tokenizing text.
# Download NLTK data (only needed once)
nltk.download('punkt')
Load the email dataset from a CSV file and handle any potential parsing errors.
# Read the CSV file
try:
df = pd.read_csv('emails.csv', quotechar='"', escapechar='\\', engine='python', on_bad_lines='skip')
except pd.errors.ParserError as e:
print(f"Error reading the CSV file: {e}")
Tokenize the email bodies into words and convert them to lowercase for uniformity.
# Preprocess: Tokenize email bodies
tokenized_bodies = [word_tokenize(body.lower()) for body in df['email_body']]
Train a Word2Vec model on the tokenized email bodies to create word embeddings.
# Train Word2Vec model on the email bodies
word2vec_model = Word2Vec(sentences=tokenized_bodies, vector_size=100, window=5, min_count=1, workers=4)
Create a function that computes the embedding of an email body by averaging the embeddings of its words.
# Function to compute document embedding by averaging word embeddings
def get_document_embedding(doc, model):
words = word_tokenize(doc.lower())
word_embeddings = [model.wv[word] for word in words if word in model.wv]
if word_embeddings:
return np.mean(word_embeddings, axis=0)
else:
return np.zeros(model.vector_size)
Calculate the document embeddings for all email bodies in the dataset.
# Compute embeddings for all email bodies
body_embeddings = np.array([get_document_embedding(body, word2vec_model) for body in df['email_body']])
Create a function that finds the most similar email body in the dataset to a given query using cosine similarity.
# Function to perform semantic search based on the email body
def semantic_search(query, model, body_embeddings, texts):
query_embedding = get_document_embedding(query, model)
similarities = cosine_similarity([query_embedding], body_embeddings)
best_match_idx = np.argmax(similarities)
return texts[best_match_idx], similarities[0, best_match_idx]
Define a new email body for which to generate a subject line.
# Example email body for which to generate a subject line
new_email_body = "Please review the attached documents and provide feedback by end of day"
Use the semantic search function to find the most similar email body in the dataset to the new email body.
# Perform semantic search for the new email body to find the most similar existing email
matched_text, similarity_score = semantic_search(new_email_body, word2vec_model, body_embeddings, df['email_body'])
Retrieve and print the subject line corresponding to the matched email body, along with the matched email body and similarity score.
# Find the corresponding subject line for the matched email body
matched_subject = df.loc[df['email_body'] == matched_text, 'subject_line'].values[0]
print("Generated Subject Line:", matched_subject)
print("Matched Email Body:", matched_text)
print("Similarity Score:", similarity_score)
Evaluating the accuracy of a model is crucial to understand its performance on unseen data. In this step, we will define the function evaluate_accuracy
, use a test dataset (test_df
), and precomputed embeddings (train_body_embeddings
) to measure the accuracy of the model.
# Evaluate accuracy on the test set
accuracy = evaluate_accuracy(test_df, word2vec_model, train_body_embeddings, train_df['email_body'])
print("Mean Cosine Similarity for Test Set:", accuracy)
I have made use of Document dataset for code implementation which can be found here.
A sneek-peak into the dataset :
Let’s walk through a real example to illustrate this step.
Assume we have a test set (test_df
) with the following email bodies and subject lines:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Download NLTK data (only needed once)
nltk.download('punkt')
# Example training dataset
train_data = {
'email_body': [
"Please send me the latest sales report.",
"Can you provide feedback on the attached document?",
"Let's schedule a meeting to discuss the new project.",
"Review the quarterly financials and get back to me."
],
'subject_line': [
"Request for Sales Report",
"Feedback on Document",
"Meeting for New Project",
"Quarterly Financial Review"
]
}
train_df = pd.DataFrame(train_data)
# Example test dataset
test_data = {
'email_body': [
"Can you provide the latest sales figures?",
"Please review the attached documents and provide feedback.",
"Schedule a meeting to discuss the new project proposal."
],
'subject_line': [
"Request for Latest Sales Figures",
"Feedback on Attached Documents",
"Meeting for Project Proposal"
]
}
test_df = pd.DataFrame(test_data)
# Preprocess: Tokenize email bodies
tokenized_bodies = [word_tokenize(body.lower()) for body in train_df['email_body']]
# Train Word2Vec model on the email bodies
word2vec_model = Word2Vec(sentences=tokenized_bodies, vector_size=100, window=5, min_count=1, workers=4)
# Function to compute document embedding by averaging word embeddings
def get_document_embedding(doc, model):
words = word_tokenize(doc.lower())
word_embeddings = [model.wv[word] for word in words if word in model.wv]
if word_embeddings:
return np.mean(word_embeddings, axis=0)
else:
return np.zeros(model.vector_size)
# Compute embeddings for all email bodies in the training set
train_body_embeddings = np.array([get_document_embedding(body, word2vec_model) for body in train_df['email_body']])
# Function to evaluate the accuracy of the model on the test set
def evaluate_accuracy(test_df, model, train_body_embeddings, train_texts):
similarities = []
for index, row in test_df.iterrows():
# Compute the embedding for the current email body in the test set
test_embedding = get_document_embedding(row['email_body'], model)
# Compute cosine similarities between the test embedding and all training email body embeddings
cos_sim = cosine_similarity([test_embedding], train_body_embeddings)
# Get the highest similarity score
best_match_idx = np.argmax(cos_sim)
highest_similarity = cos_sim[0, best_match_idx]
similarities.append(highest_similarity)
# Return the mean cosine similarity
return np.mean(similarities)
# Evaluate accuracy on the test set
accuracy = evaluate_accuracy(test_df, word2vec_model, train_body_embeddings, train_df['email_body'])
print("Mean Cosine Similarity for Test Set:", accuracy)
Output:
Mean Cosine Similarity for Test Set: 0.86
The project shows how to generate smart email subject lines easier by using Word2Vec embeddings. To produce vector embeddings of email bodies the procedure consists of preprocessing the email data and training a Word2Vec model. Further enhancements include incorporating more sophisticated models and optimizing the methodology for enhanced efficacy. Applications for this concept can be for a company that wants to improve their open open rates of their email marketing campaigns by using more engaging and relevant subject lines. A news website wants to send personalized newsletters to its subscribers based on their reading preferences.
A. Word2Vec is a technique that converts words into numerical vectors to capture their meanings. This project uses it to construct email body embeddings which facilitates the generation of relevant subject lines based on semantic similarity.
A. Data preparation entails fixing erroneous rows, eliminating superfluous characters, and making sure the formatting is uniform throughout the dataset. To effectively train the model text data handling and tokenization must be done correctly.
A. Assuring high-quality embeddings managing context ambiguity and working with enormous datasets are typical difficulties. To attain best performance data preparation is crucial
A. While training the model on existing email bodies, it may struggle with entirely new or unique email bodies that differ from the training data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Good article. Learned a lot from this