Part 2: Topic Modeling and Latent Dirichlet Allocation (LDA) using Gensim and Sklearn
This article was published as a part of the Data Science Blogathon
In the previous article, we had started with understanding the basic terminologies of text in Natural Language Processing(NLP), what is topic modeling, its applications, the types of models, and the different topic modeling techniques available.
Let’s continue from there, explore Latent Dirichlet Allocation (LDA), working of LDA, and its similarity to another very popular dimensionality reduction technique called Principal Component Analysis (PCA).
Table of Contents
- A Little Background about LDA
- Latent Dirichlet Allocation (LDA) and its Process
- How does LDA work and how will it derive the particular distributions?
- Vector Space of LDA
- How will LDA optimize the distributions?
- LDA is an Iterative Process
- The Similarity between LDA and PCA
A Little Background about LDA
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.
Now, the topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.
Named after the German mathematician, Peter Gustav Lejeune Dirichlet, Dirichlet processes in probability theory are “a family of stochastic processes whose realizations are probability distributions.”
This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!
Okay, that’s great! But how is the Dirichlet process helpful for us in retrieving the topics from the documents? Remember from part 1 of the blog, topics or themes are a group of statistically significant words within a corpus.
So, without going into the technicalities of the process, in our context, the Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other.
And this stochastic process uses Bayesian inferences for explaining “the prior knowledge about the distribution of random variables”. In the case of topic modeling, the process helps in estimating what are the chances of the words, which are spread over the document, will occur again? This enables the model to build data points, estimate probabilities, that’s why LDA is a breed of generative probabilistic model.
LDA generates probabilities for the words using which the topics are formed and eventually the topics are classified into documents.
Latent Dirichlet Allocation (LDA) and its Process
A tool and technique for Topic Modeling, Latent Dirichlet Allocation (LDA) classifies or categorizes the text into a document and the words per topic, these are modeled based on the Dirichlet distributions and processes.
The LDA makes two key assumptions:
- Documents are a mixture of topics, and
- Topics are a mixture of tokens (or words)
And, these topics using the probability distribution generate the words. In statistical language, the documents are known as the probability density (or distribution) of topics and the topics are the probability density (or distribution) of words.
Now, how does LDA work and how will it derive the particular distributions?
Firstly, LDA applies the above two important assumptions to the given corpus. Let’s say we have the corpus with the following five documents:
- Document 1: I want to watch a movie this weekend.
- Document 2: I went shopping yesterday. New Zealand won the World Test Championship by beating India by eight wickets at Southampton.
- Document 3: I don’t watch cricket. Netflix and Amazon Prime have very good movies to watch.
- Document 4: Movies are a nice way to chill however, this time I would like to paint and read some good books. It’s been so long!
- Document 5: This blueberry milkshake is so good! Try reading Dr. Joe Dispenza’s books. His work is such a game-changer! His books helped to learn so much about how our thoughts impact our biology and how we can all rewire our brains.
Any corpus, which is the collection of documents, can be represented as a document-word (or document term matrix) also known as DTM.
We know the first step with the text data is to clean, preprocess and tokenize the text to words. After preprocessing the documents, we get the following document word matrix where:
- D1, D2, D3, D4, and D5 are the five documents, and
- the words are represented by the Ws, say there are 8 unique words from W1, to W8.
Hence, the shape of the matrix is 5 * 8 (five rows and eight columns):
So, now the corpus is mainly the above-preprocessed document-word matrix, in which every row is a document and every column is the tokens or the words.
LDA converts this document-word matrix into two other matrices: Document Term matrix and Topic Word matrix as shown below:
The Document-Topic matrix already contains the possible topics (represented by K above) that the documents can contain. Here, suppose we have 5 topics and had 5 documents so the matrix is of dimension 5*6
The Topic-Word matrix has the words (or terms) that those topics can contain. We have 5 topics and 8 unique tokens in the vocabulary hence the matrix had a shape of 6*8.
Vector Space of LDA
The entire LDA space and its dataset are represented by the diagram below:
- The yellow box refers to all the documents in the corpus (represented by M). In our case, M = 5 as we have 5 documents.
- Next, the peach color box is the number of words in a document, given by N
- Inside this peach box, there can be many words. One of those words is w, which is in the blue color circle.
According to LDA, every word is associated (or related) with a latent (or hidden) topic, which here is stated by Z. Now, this assignment of Z to a topic word in these documents gives a topic word distribution present in the corpus that is represented by theta (𝛳).
The LDA model has two parameters that control the distributions:
- Alpha (ɑ) controls per-document topic distribution, and
- Beta (ꞵ) controls per topic word distribution
- M: is the total documents in the corpus
- N: is the number of words in the document
- w: is the Word in a document
- z: is the latent topic assigned to a word
- theta (𝛳): is the topic distribution
- LDA model s parameters: Alpha (ɑ) and Beta (ꞵ)
How will LDA optimize the distributions?
The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word distribution.
As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA backtracks from the document level to identify which topics would have generated these documents and which words would have generated those topics.
Now, our corpus that had 5 documents (D1 to D5) and with their respective number of words:
- D1 = (w1, w2, w3, w4, w5, w6, w7, w8)
- D2 = (w`1, w`2, w`3, w`4, w`5, w`6, w`7, w`8, w`9, w`10)
- D3 = (w“1, w“2, w“3, w“4, w“5, w“6, w“7, w“8, w“9, w“10, w“11, w“12, w“13, w“14 w“15)
- D4 = (w“`1, w“`2, w“`3, w“`4, w“`5, w“`6, w“`7, w“`8, w“`9, w“`10, w“`11, w“`12)
- D5 = (w““1, w““2, w““3, w““4, w““5, w““6, w““7, w““8, w““9, w““10,…, w““32, w““33, w““34)
LDA is an iterative process
The first iteration of LDA:
In the first iteration, it randomly assigns the topics to each word in the document. The topics are represented by the letter k. So, in our corpus, the words in the documents will be associated with some random topics like below:
- D1 = (w1 (k5), w2 (k3), w3 (k1), w4 (k2), w5 (k5), w6 (k4), w7 (k7), w8(k1))
- D2 = (w`1(k2), w`2 (k4), w`3 (k2), w`4 (k1), w`5 (k2), w`6 (k1), w`7 (k5), w`8(k3), w`9 (k7), w`10(k1))
- D3 = (w“1(k3), w“2 (k1), w“3 (k5), w“4 (k3), w“5 (k4), w“6(k1),…, w“13 (k1), w“14(k3), w“15 (k2))
- D4 = (w“`1(k4), w“`2 (k5), w“`3 (k3), w“`4 (k6), w“`5 (k5), w“`6 (k3) …, w“`10 (k3), w“`11 (k7), w“`12 (k1))
- D5 = (w““1 (k1), w““2 (k7), w““3 (k2), w““4 (k8), w““5 (k1), w““6(k8) …, w““32(k3), w““33(k6), w““34 (k5))
This gives the output as Documents with the composition of Topics and Topics composing of words:
The documents are the mixture of the topics:
- D1 = k5 + k3 + k1 + k2 + k5 + k4 + k7+ k1
- D2 = k2 + k4 + k2 + k1 + k5 + k2 + k1+ k5 + k3 + k7 + k1
- D3 = k4 + k5 + k3 + k6 + k5 + k3 + … + k3+ k7 + k1
- D3 = k1 + k7 + k2 + k8 + k1 + k8 + … + k3+ k6 + k5
The topics are the mixture of the words:
- K1 = w3 + w8 + w`4 + w`6 + w’10 + w“2 + w“6 + … + w“13 + w“`12 + w““1 + w““5
- K2 = w4 + w`1 + w`3 + w“15 + …. + w““3 + …
- K3 = w2 + w’8 + w“1 + w“4 + w“14 + w“`3 + w“`6 + … + w“`10 + w““32 + …
Similarly, LDA will give the word combinations for other topics.
Post the first iteration of LDA:
After the first iteration, LDA does provide the initial document- topic and topic-word matrices. The task at hand is to optimize these obtained results which LDA does by iterating over all the documents and all the words.
LDA makes another assumption that all the topics that have been assigned are correct except the current word. So, based on those already-correct topic-word assignments, LDA tries to correct and adjust the topic assignment of the current word with a new assignment for which:
LDA will iterate over: each document ‘D’ and each word ‘w’
How would it do that? It does so by computing two probabilities: p1 and p2 for every topic (k) where:
P1: proportion of words in the document (D) that are currently assigned to the topic (k)
P2: is the proportion of assignments to the topic(k) over all documents that come from this word w. In other words, p2 is the proportion of those documents in which the word (w) is also assigned to the topic (k)
The formula for p1 and p2 is:
- P1 = proportion (topic k / document D) ,and
- P2 = proportion(word w / topic k)
Now, using these probabilities p1 and p2, LDA estimates a new probability, which is the product of (p1*p2), and through this product probability, LDA identifies the new topic, which is the most relevant topic for the current word.
Reassignment of word ‘w’ of the document ‘D’ to a new topic ‘k’ via the product probability of p1 * p2
Now, the LDA is performed for a large number of iterations for the step of choosing the new topic ‘k’ until a steady-state is obtained. The convergence point of LDA is obtained where it gives the most optimized representation of the document-term matrix and topic-word matrix.
This completes the working and the process of Latent Dirichlet Allocation.
Now, before moving on to the implementation of LDA, here’s one last thing …
The Similarity between LDA and PCA
Topic Modeling is similar to Principal Component Analysis (PCA). You may be wondering how is that? Allow me to explain.
So, PCA is a dimensionality reduction technique, right? And, it is used for data having numerical values. It is the linear combination of variables from which components are derived that are used to build the model. PCA works by breaking or decomposing a larger value (i.e. a singular value) into smaller values to reduce the dimensions.
LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.
Let’s say we have the following corpus of documents having the tokenized words freekick, dunk, rebound, foul, shoot, NBA, Liverpool as below:
In case, we do not apply LDA or any of the topic modeling techniques then, we will have the
tokenized words (after the text preprocessing steps) which are on the left panel of the above image. These words will then be passed to build a model and may have more features (more than present in the image).
Now, instead of using the tokenized words obtained after applying the Bag of Words vectorizer, if we transform the initial document-word matrix into two parts: one for the word per topic and the other for the topic per document.
See, in the above image, rather than using the words: freekick, dunk, rebound, foul, shoot, NBA, Liverpool, we have combined similar words into two groups: Soccer and Basketball.
This gave us two different topics: one on football where Soccer consists of the words freekick, shoot, and Liverpool, and the other topic Basketball consisting of the words dunk, rebound, foul, and NBA.
This breaking of the initial larger word features matrix not only enabled us to classify text into documents but it also essentially reduced the features used to build the model.
LDA breaks the corpus document word into lower-dimensional matrices. Therefore, Topic modeling and its techniques are also used for dimensionality reduction.
Latent Dirichlet Allocation (LDA) does two tasks: it finds the topics from the corpus, and at the same time, assigns these topics to the document present within the same corpus. The below schematic diagram summarizes the process of LDA well:
We will apply all theory that we learned and execute in Python using the gensim and sklearn packages in the next and last part of the Topic Modeling series.
- Dirichlet Process by Yee Whye Teh, University College London
Thank You for reading and Happy Learning! 🙂
Hi there! I am Neha Seth, a technical writer for AnalytixLabs. I hold a Postgraduate Program in Data Science & Engineering from the Great Lakes Institute of Management and a Bachelors in Statistics. I have been featured as Top 10 Most Popular Guest Authors in 2020 on Analytics Vidhya (AV).