Part 2: Topic Modeling and Latent Dirichlet Allocation (LDA) using Gensim and Sklearn

Neha Last Updated : 26 Aug, 2021

10 min read

This article was published as a part of the Data Science Blogathon

Introduction

In the previous article, we had started with understanding the basic terminologies of text in Natural Language Processing(NLP), what is topic modeling, its applications, the types of models, and the different topic modeling techniques available.

Let’s continue from there, explore Latent Dirichlet Allocation (LDA), working of LDA, and its similarity to another very popular dimensionality reduction technique called Principal Component Analysis (PCA).

A Little Background about LDA
Latent Dirichlet Allocation (LDA) and its Process
- How does LDA work and how will it derive the particular distributions?
- Vector Space of LDA
- How will LDA optimize the distributions?
- LDA is an Iterative Process
The Similarity between LDA and PCA

A Little Background about LDA

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.

Now, the topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.

Named after the German mathematician, Peter Gustav Lejeune Dirichlet, Dirichlet processes in probability theory are “a family of stochastic processes whose realizations are probability distributions.”

This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!

Okay, that’s great! But how is the Dirichlet process helpful for us in retrieving the topics from the documents? Remember from part 1 of the blog, topics or themes are a group of statistically significant words within a corpus.

So, without going into the technicalities of the process, in our context, the Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other.

And this stochastic process uses Bayesian inferences for explaining “the prior knowledge about the distribution of random variables”. In the case of topic modeling, the process helps in estimating what are the chances of the words, which are spread over the document, will occur again? This enables the model to build data points, estimate probabilities, that’s why LDA is a breed of generative probabilistic model.

LDA generates probabilities for the words using which the topics are formed and eventually the topics are classified into documents.

Latent Dirichlet Allocation (LDA) and its Process

A tool and technique for Topic Modeling, Latent Dirichlet Allocation (LDA) classifies or categorizes the text into a document and the words per topic, these are modeled based on the Dirichlet distributions and processes.

The LDA makes two key assumptions:

Documents are a mixture of topics, and
Topics are a mixture of tokens (or words)

And, these topics using the probability distribution generate the words. In statistical language, the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.

Now, how does LDA work and how will it derive the particular distributions?

Firstly, LDA applies the above two important assumptions to the given corpus. Let’s say we have the corpus with the following five documents:

Document 1: I want to watch a movie this weekend.
Document 2: I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.
Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.
Document 4: Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been so long!
Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.

Any corpus, which is the collection of documents, can be represented as a document-word (or document term matrix) also known as DTM.

We know the first step with the text data is to clean, preprocess and tokenize the text to words. After preprocessing the documents, we get the following document word matrix where:

D1, D2, D3, D4, and D5 are the five documents, and
the words are represented by the Ws, say there are 8 unique words from W1, to W8.

Hence, the shape of the matrix is 5 * 8 (five rows and eight columns):

So, now the corpus is mainly the above-preprocessed document-word matrix, in which every row is a document and every column is the tokens or the words.

LDA converts this document-word matrix into two other matrices: Document Term matrix and Topic Word matrix as shown below:

These matrices:

The Document-Topic matrix already contains the possible topics (represented by K above) that the documents can contain. Here, suppose we have 5 topics and had 5 documents so the matrix is of dimension 5*6
The Topic-Word matrix has the words (or terms) that those topics can contain. We have 5 topics and 8 unique tokens in the vocabulary hence the matrix had a shape of 6*8.

Vector Space of LDA

The entire LDA space and its dataset are represented by the diagram below:

vector space of LDA

Source: researchgate.net

The yellow box refers to all the documents in the corpus (represented by M). In our case, M = 5 as we have 5 documents.
Next, the peach color box is the number of words in a document, given by N
Inside this peach box, there can be many words. One of those words is w, which is in the blue color circle.

According to LDA, every word is associated (or related) with a latent (or hidden) topic, which here is stated by Z. Now, this assignment of Z to a topic word in these documents gives a topic word distribution present in the corpus that is represented by theta (𝛳).

The LDA model has two parameters that control the distributions:

Alpha (ɑ) controls per-document topic distribution, and
Beta (ꞵ) controls per topic word distribution

To summarize:

M: is the total documents in the corpus
N: is the number of words in the document
w: is the Word in a document
z: is the latent topic assigned to a word
theta (𝛳): is the topic distribution
LDA model s parameters: Alpha (ɑ) and Beta (ꞵ)

How will LDA optimize the distributions?

The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word distribution.

As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA backtracks from the document level to identify which topics would have generated these documents and which words would have generated those topics.

Now, our corpus that had 5 documents (D1 to D5) and with their respective number of words:

D1 = (w1, w2, w3, w4, w5, w6, w7, w8)
D2 = (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)
D3 = (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11, w“12, w“13, w“14 w“15)
D4 = (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10, w“`11, w“`12)
D5 = (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,…, w““32, w““33, w““34)

LDA is an iterative process

The first iteration of LDA:

In the first iteration, it randomly assigns the topics to each word in the document. The topics are represented by the letter k. So, in our corpus, the words in the documents will be associated with some random topics like below:

D1 = (w1 (k5), w2 (k3), w3 (k1), w4 (k2), w5 (k5), w6 (k4), w7 (k7), w8(k1))
D2 = (w`1(k2), w`2 (k4), w`3 (k2), w`4 (k1), w`5 (k2), w`6 (k1), w`7 (k5), w`8(k3), w`9 (k7), w`10(k1))
D3 = (w“1(k3), w“2 (k1), w“3 (k5), w“4 (k3), w“5 (k4), w“6(k1),…, w“13 (k1), w“14(k3), w“15 (k2))
D4 = (w“`1(k4), w“`2 (k5), w“`3 (k3), w“`4 (k6), w“`5 (k5), w“`6 (k3) …, w“`10 (k3), w“`11 (k7), w“`12 (k1))
D5 = (w““1 (k1), w““2 (k7), w““3 (k2), w““4 (k8), w““5 (k1), w““6(k8) …, w““32(k3), w““33(k6), w““34 (k5))

This gives the output as Documents with the composition of Topics and Topics composing of words:

The documents are the mixture of the topics:

D1 = k5 + k3 + k1 + k2 + k5 + k4 + k7+ k1
D2 = k2 + k4 + k2 + k1 + k5 + k2 + k1+ k5 + k3 + k7 + k1
D3 = k4 + k5 + k3 + k6 + k5 + k3 + … + k3+ k7 + k1
D3 = k1 + k7 + k2 + k8 + k1 + k8 + … + k3+ k6 + k5

The topics are the mixture of the words:

K1 = w3 + w8 + w`4 + w`6 + w’10 + w“2 + w“6 + … + w“13 + w“`12 + w““1 + w““5
K2 = w4 + w`1 + w`3 + w“15 + …. + w““3 + …
K3 = w2 + w’8 + w“1 + w“4 + w“14 + w“`3 + w“`6 + … + w“`10 + w““32 + …

Similarly, LDA will give the word combinations for other topics.

Post the first iteration of LDA:

After the first iteration, LDA does provide the initial document- topic and topic-word matrices. The task at hand is to optimize these obtained results which LDA does by iterating over all the documents and all the words.

LDA makes another assumption that all the topics that have been assigned are correct except the current word. So, based on those already-correct topic-word assignments, LDA tries to correct and adjust the topic assignment of the current word with a new assignment for which:

LDA will iterate over: each document ‘D’ and each word ‘w’

How would it do that? It does so by computing two probabilities: p1 and p2 for every topic (k) where:

P1: proportion of words in the document (D) that are currently assigned to the topic (k)
P2: is the proportion of assignments to the topic(k) over all documents that come from this word w. In other words, p2 is the proportion of those documents in which the word (w) is also assigned to the topic (k)

The formula for p1 and p2 is:

P1 = proportion (topic k / document D) ,and
P2 = proportion(word w / topic k)

Now, using these probabilities p1 and p2, LDA estimates a new probability, which is the product of (p1*p2), and through this product probability, LDA identifies the new topic, which is the most relevant topic for the current word.

Reassignment of word ‘w’ of the document ‘D’ to a new topic ‘k’ via the product probability of p1 * p2

Now, the LDA is performed for a large number of iterations for the step of choosing the new topic ‘k’ until a steady-state is obtained. The convergence point of LDA is obtained where it gives the most optimized representation of the document-term matrix and topic-word matrix.

This completes the working and the process of Latent Dirichlet Allocation.

Now, before moving on to the implementation of LDA, here’s one last thing …

The Similarity between LDA and PCA

Topic Modeling is similar to Principal Component Analysis (PCA). You may be wondering how is that? Allow me to explain.

So, PCA is a dimensionality reduction technique, right? And, it is used for data having numerical values. It is the linear combination of variables from which components are derived that are used to build the model. PCA works by breaking or decomposing a larger value (i.e. a singular value) into smaller values to reduce the dimensions.

LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.

Let’s say we have the following corpus of documents having the tokenized words freekick, dunk, rebound, foul, shoot, NBA, Liverpool as below:

bag of words

Source: miro.medium.com

In case, we do not apply LDA or any of the topic modeling techniques then, we will have the

tokenized words (after the text preprocessing steps) which are on the left panel of the above image. These words will then be passed to build a model and may have more features (more than present in the image).

Now, instead of using the tokenized words obtained after applying the Bag of Words vectorizer, if we transform the initial document-word matrix into two parts: one for the word per topic and the other for the topic per document.

See, in the above image, rather than using the words: freekick, dunk, rebound, foul, shoot, NBA, Liverpool, we have combined similar words into two groups: Soccer and Basketball.

This gave us two different topics: one on football where Soccer consists of the words freekick, shoot, and Liverpool, and the other topic Basketball consisting of the words dunk, rebound, foul, and NBA.

This breaking of the initial larger word features matrix not only enabled us to classify text into documents but it also essentially reduced the features used to build the model.

LDA breaks the corpus document word into lower-dimensional matrices. Therefore, Topic modeling and its techniques are also used for dimensionality reduction.

Endnotes

Latent Dirichlet Allocation (LDA) does two tasks: it finds the topics from the corpus, and at the same time, assigns these topics to the document present within the same corpus. The below schematic diagram summarizes the process of LDA well:

Source: researchgate.net

We will apply all theory that we learned and execute in Python using the gensim and sklearn packages in the next and last part of the Topic Modeling series.

References:

Thank You for reading and Happy Learning! 🙂

About me

Hi there! I am Neha Seth, a technical writer for AnalytixLabs. I hold a Postgraduate Program in Data Science & Engineering from the Great Lakes Institute of Management and a Bachelors in Statistics. I have been featured as Top 10 Most Popular Guest Authors in 2020 on Analytics Vidhya (AV).

My area of interest lies in NLP and Deep Learning. I have also passed the CFA Program. You can reach out to me on LinkedIn and can read my other blogs for ALabs and AV.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Neha

Hi there! I am Neha Seth. I work as a Data Scientist in Larsen & Toubro Infotech (LTI). I hold a Postgraduate Program in Data Science & Engineering from the Great Lakes Institute of Management and a Bachelors in Statistics. I have been featured as Top 10 Most Popular Guest Authors in 2020 on Analytics Vidhya (AV).

My area of interest lies in NLP and Deep Learning. I have also passed the CFA Program. You can reach out to me on LinkedIn and can read my other blogs for AV.

Advanced Data Science Libraries NLP Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Prio Adi Ramadhani

Hey Neha Seth, thank you very much for the explanation about LDA. Very helpfull! It gives clearer understanding about how LDA works. note: there are some typo in matrices explanation, where the topics should be written '6', but you write it '5'. Regards, Prio

Charan 143

super hard word done by this natural language process

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Part 2: Topic Modeling and Latent Dirichlet Allocation (LDA) using Gensim and Sklearn

Introduction

Table of Contents

A Little Background about LDA

Latent Dirichlet Allocation (LDA) and its Process

Now, how does LDA work and how will it derive the particular distributions?

Vector Space of LDA

How will LDA optimize the distributions?

LDA is an iterative process

The first iteration of LDA:

Post the first iteration of LDA:

The Similarity between LDA and PCA

Endnotes

About me

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or