Detect Cyberbullying Using Topic Modeling and Sentiment Analysis
With the rise in internet penetration across the world, followed by the rapid growth of social media companies, users are increasingly using various social media platforms to interact and engage with other like-minded individuals and also follow their favourite celebrities and influencers. With the increased use of social media, there has been a significant rise in cyberbullying cases as well. According to the youth activism non-profit organization DoSomething, about 37% of teenagers between the ages of 12 and 17 have been bullied online. And 23 percent of students have said that they have done something cruel or mean to someone else. Owing to the rise in cyberbullying cases, it is important to monitor and control such cases to avoid greater harm to the minds of young people.
In this article, we will cover an unsupervised learning method of Topic Modeling and a supervised learning method of Sentiment Classification to identify topics in the dataset. Real-world text data comes with a large number of unique tokens, which can be complex to comprehend. It is difficult and costly to label textual instances for supervised classification as opposed to unsupervised learning methods. This article explores the importance of Topic Modeling for large amounts of text corpus over supervised learning methods with hands-on project implementation. So let’s dive deep into the article.
This article was published as a part of the Data Science Blogathon.
Table of Contents
What is Topic Modeling?
Sentiment classification is an unsupervised machine learning approach to extract frequently discussed topics from a certain text corpus. Unlike a supervised learning method, an unsupervised learning method does not have any labels associated with each document in the training corpus. Each topic of the text corpus consists of a composition of words available in documents. The corpus of the documents, or text corpus, contains multiple topics that depend on the context of the text data.
In addition, we will learn various methods for Topic Modeling which are used in the industry. There are mainly two types of Topic Modeling techniques-
- Traditional Topic Modeling
- Neural Topic Modeling
Let’s look at the Traditional Topic Modeling techniques and their applications in industries.
1. Traditional Topic Modeling
These types of Topic Modeling techniques are based on statistics and probabilistic models. These techniques assume that each document contains a set of topics and each topic is distributed over words. In these techniques, models are trained using Matrix Factorization techniques or statistical inference.
In Matrix Factorization techniques, we have a Non-Negative Matrix Factorization (NNMF) model. While in statistical and probabilistic methods we have Latent Dirichlet Allocation (LDA) modeling technique which is used widely in topic modeling tasks.
Non-Negative Matrix Factorization aims to reduce a high dimensional dataset into a lower dimensional dataset composed of non-negative vectors. This helps capture essential structure and variability of the dataset to identify a set of topics and themes that can explain word frequencies in document term-matrix.
In opposition to that, Latent Dirichlet Allocation aims to identify hidden topics from a large text corpus using a probabilistic generative model. It assumes topics are distributed over words of each document and an algorithm calculates the probability of each topic based on each word in the document.
Using the coherence metric to measure the performance of the LDA model
The coherence metric is used to measure how sufficiently topics are identified in a given text corpus. When we talk about ‘Coherence’, we talk about cooperation characteristics between reference corpus and identified topics.
Topic coherence assesses how well topic is supported by a text corpus or a reference text. It uses statistics and probability to compare the distribution of words and topics of a given corpus. It then assigns a coherence score to each topic. Finally, it aggregates all the individual scores to give a single coherence score to the model.
The intuition behind the topic coherence metric
To understand topic coherence in a simple manner as opposed to going with heavy math and statistics, the method takes selected topics and references corpora as input. It then segments topics into various pairs and calculates the probabilities of words in the text corpus. Finally, it calculates confirmation measures that simply tell us how well each topic pair is present in the text corpus and what words support the topic pair in the text corpus. Then, all the confirmation measures are summed to come up with a topic coherence score, which will be in the range of 0 to 1. A topic coherence score closer to 1 means better performance in Topic Modeling.
We will look at hands-on Topic Modeling with code examples in later sections.
We will look at the implementation of the coherence metric in the hands-on project in the implementation section.
2. Neural Topic Modeling
Neural Topic Modeling uses neural networks to capture complex relationships between words in the text corpus. Unlike Traditional Topic Modeling techniques, it does not use frequency of words or TF-IDF methods to identify the most frequently occurring words or topics in our case. Neural network-based Topic Modeling techniques can capture the context of the text corpus, which is not possible in Traditional Topic Modeling methods.
Types of Topic Modeling Techniques
There are two types of Topic modeling techniques available:
- Contextualized Topic Modeling — It incorporates contextualized embeddings from a text corpus, like words and proximities of words to better capture topics.
- BERTopic — It uses the pre-trained model BERT to embed words in the text corpus to extract topics in large collections of documents.
Applications of Topic Modeling
Now that we have learned what is topic modeling, let’s look at some of the applications of topic modeling in industry.
- Marketing — Topic Modeling can be used to analyze customer reviews and feedback to discover the sentiments of the customers along with identifying new trends in the text corpus
- Healthcare — In the healthcare sector Topic Modeling can be used to analyze medical records, identify patterns, and extract relevant information.
- Legal — Topic Modeling can be used to analyze legal documents, identify key issues, and extract relevant information.
What is Sentiment Classification — a Supervised Classification?
Sentiment Classification is a Natural Language Processing (NLP) technique used to classify text data according to the sentiment expressed in the text, such as positive, negative, or neutral. In the context of cyberbullying, Sentiment Classification can be used to identify the sentiment of the text as being indicative of bullying behavior. We want to classify text as a positive tweet or a negative tweet indicating bullying behavior. We will look at the code examples to archive the same in a later section.
Applications of the Sentiment Classification
Sentiment analysis has many applications in the industry. Let’s look at some of them:
- Social Media Monitoring – Companies use social media to engage with customers and maintain their online presence. It is important to monitor customer engagement and conversation on social media platforms to measure how well companies’ products and services are received.
- Customer Support Ticket Analysis -Companies have online customer support ticket systems to manage queries and concerns, By analyzing support ticket conversations companies can know monitor feedback from customers and measure the sentiment of customers.
- Brand Monitoring and Management – Sentiment Analysis techniques can be used to monitor brands across social media platforms and other online presence.
Depending upon each application, text samples need to be classified as very positive, positive, neutral, negative, or very negative.
Differences Between Topic Modeling and Supervised Sentiment Classification
One of the major differences between topic modeling and sentiment classification is their learning method itself. Topic Modeling is an unsupervised learning technique while Sentiment Classification is a supervised learning technique. Let’s look at some other differences:
|Topic Modeling||Sentiment Classification|
|There is no need to label large text document||One has to label large samples|
|It can identify complex word similarities within one document||It is not possible to identify similarities within one single document|
|It has a lower cost of modeling and inference due to ease flexibility||It has a higher cost of modeling due to manual labeling of text samples|
There is no need to label large text documentOne has to label large samplesIt can identify complex word similarities within one documentIt is not possible to identify similarities within one single documentIt has a lower cost of modeling and inference due to ease flexibilityIt has a higher cost of modeling due to manual labeling of text samples
While Sentiment Analysis is a popular approach used widely in industry, it has many drawbacks which can not be avoided. Cost of labeling each text document would significantly increase which might not be a viable option to have. In a large text corpus, each text document may have different topics to infer which is impossible to label in a supervised learning approach. Topic Modeling can identify and capture such relationships within the document to cluster the topics accordingly.
Hands-on Project Implementation Using Python
In this section, we will look at the implementation of Topic Modeling using the Gensim library of Python. We will also compare Topic Modeling with the Sentiment Classification technique as well.
Topic Modeling Using the ‘Gensim’ Library
First, we will load the dataset of cyberbullying tweets data. The dataset is annotated as ‘none’, ‘racism’, and ‘sexism’ categories. Labels are assigned as a ‘0’ for the non-bullying tweets and a ‘1’ for bullying tweets in the dataset.
Let’s read the dataset and perform a topic modeling pipeline on textual data with its interpretation using LDAviz. We will also measure performance using the coherence metric to find an optimal number of topics in the dataset.
As we can see in the dataset output, the text column is a series of tweets with annotations and labels. There are more than 16000 rows in the dataset so labeling each tweet would have been a costly task. This increases the cost of the data science project which needs to be taken into account. While Topic Modeling does not require labels as such so, it saves the cost for the company of client in identifying the most prevalent topics in the dataset.
Let’s implement the Topic Modeling pipeline in the next step:
# define pre-processing function to model topics based on annotation def preprocess_topic(df, topic): """ Preprocessing function to model text data based on give topics. args: df = input dataframe topic = input topic "nonn", "sexism", or "racism" returns: corpus of words under given topic """ corpus= # topic wise division if topic == 'none': for doc in ndf[ndf['Annotation'] == 'none']['cleaned_text']: stop_word_removal = remove_stowords(doc) lemmmatized_sample = lemma_clean_text(stop_word_removal) words = lemmmatized_sample.split() corpus.append(words) elif topic == 'sexism': for doc in ndf[ndf['Annotation'] == 'sexism']['cleaned_text']: stop_word_removal = remove_stowords(doc) lemmmatized_sample = lemma_clean_text(stop_word_removal) words = lemmmatized_sample.split() corpus.append(words) elif topic == 'racism': for doc in ndf[ndf['Annotation'] == 'racism']['cleaned_text']: stop_word_removal = remove_stowords(doc) lemmmatized_sample = lemma_clean_text(stop_word_removal) words = lemmmatized_sample.split() corpus.append(words) return corpus
Above code takes user input to choose one of the annotations in our dataset to perform Topic Modeling on a subset of the text corpus.
(Note: Above code takes the cleaned text with pre-processed text from the original data frame. I have linked the code repository at the end of this article with more details.)
# corpus of the words corpus = preprocess_topic(ndf, 'sexism') # creat BOW model from corpus dic=gensim.corpora.Dictionary(corpus) bow_corpus = [dic.doc2bow(doc) for doc in corpus] # create LDA model using gensim library lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 4, id2word = dic, passes = 10, workers = 2) lda_model.show_topics()
The above code stores the text corpus in the ‘corpus’ object and creates the dictionary of the text corpus. In the next step, we call the ‘LdaMulticore’ object of the ‘gensim.models’ module in order to model the text data and generate 4 topics in the training dataset. Finally, we can call ‘lda_model.show_topics()’ to see 4 topics.
As an output of the training pipeline, the model will generate a list of tuples containing the 4 most prevalent topics in the text corpus along with its word distribution.
Visual Interpretation of Topic Modeling Output
# visualizing the topics def plot_lda_vis(lda_model, bow_corpus, dic): pyLDAvis.enable_notebook() vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dic) return vis plot_lda_vis(lda_model, bow_corpus, dic)
In the above visualization, each topic is shown on an intertopic distance map which explains how far each topic is from the others. On the right side, a bar chart of word frequency is shown with the most salient terms occurring in the text corpus.
Using ‘pyLDAvis’ we can visualize the distribution of the topics and words in the text corpus to make it more interpretable for the stakeholders.
Calculating the Coherence Metric of the Model
# assessing coherenece metric of the model from gensim.models.coherencemodel import CoherenceModel topics = [['prophet', 'slavery', 'violence', 'fear'], ['people', 'religion', 'slave', 'hate', 'like'], ['like', 'murder', 'people', 'prophet'], ['war', 'humanity', 'religion', 'salon', 'world']] # Coherence model cm = CoherenceModel(topics=topics, texts=corpus, coherence='c_v', dictionary=dic) coherence_per_topic = cm.get_coherence_per_topic() coherence_per_topic --------------------------------[output]-------------------------------------- [0.24646713695437958, 0.17976752238536964, 0.32051023235616505, 0.33402730347565524]
To calculate the coherence metric of our topic model we can use the ‘CoherenceModel’ function of the ‘gensim.models.coherencemodel’ module. By setting function parameters as shown above we can get the coherence score of each topic in our corpus. The function implements a coherence metric pipeline under the hood which we saw in the earlier section.
Now let’s visualize the coherence score of each topic using the seaborn library.
# plotting coherenece score topics_str = [ '\n '.join(t) for t in topics ] data_topic_score = pd.DataFrame( data=zip(topics_str, coherence_per_topic), columns=['Topic', 'Coherence'] ) data_topic_score = data_topic_score.set_index('Topic') # plottinh using matplotlib heatmap fig, ax = plt.subplots( figsize=(2,6) ) ax.set_title("Topics coherence\n $C_v$") sns.heatmap(data=data_topic_score, annot=True, square=True, cmap='Reds', fmt='.2f', linecolor='black', ax=ax ) plt.yticks( rotation=0 ) ax.set_xlabel('') ax.set_ylabel('') fig.show()
In the above example, topic coherence is still low. So to improve the model performance one can try a different number of topics to train the topic model and find the optimal number of topics in the dataset.
Sentiment Classification using TF-IDF vectorization
To detect cyberbullying in a text corpus of tweets, Sentiment Classification can be used to classify each tweet as either containing or not containing bullying behavior. This can be achieved by training supervised learning algorithms like Multinomial Naive Bayes or Support Vector Machines. We will implement the Naive Bayes algorithm to classify the sentiment of each tweet.
X = ndf['correct_text'] y = ndf['oh_label'] # train and test split the dataset X_trn, X_tst, y_trn, y_tst = train_test_split(X,y, random_state=42) # tfidf object tfidf = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_features=5000) # Vectorization using iftdf X_trn_vect = tfidf.fit_transform(X_trn) X_tst_vect = tfidf.transform(X_tst) # converting sparse dataframe into pandas dataframe x_t1 = pd.DataFrame(X_trn_vect.toarray(),columns=tfidf.get_feature_names()) x_t2 = pd.DataFrame(X_tst_vect.toarray(),columns=tfidf.get_feature_names()) # applying MultinomialNB algorithms clf = MultinomialNB() clf.fit(x_t1, y_trn) pred = clf.predict(x_t2) # LOG LOSS of the model print("logloss: %0.3f " % log_loss(y_tst.values, pred)) -------------------------------[Output]------------------------------------- logloss: 8.241
The code performs Sentiment Classification using the Multinomial Naive Bayes algorithm on a dataset consisting of two columns. The first one containing the text data (X) and the other containing the corresponding labels (y).
Then dataset is split into train and test data followed by TF-IDF vectorization using the sklearn library. Then we use the Multinomial Naive Bayes classifier to build a classification model and test it on a dataset.
Github code repository: Sentiment classificatio
In order to analyze large text corpora Topic Modeling and Sentiment Analysis are two crucial Natural Language Processing techniques which are used. While both techniques are used to extract insights from text data however they differ in their approach and goals.
Sentiment Classification is a technique used to classify the sentiment expressed in a piece of text as positive, neutral, or negative. This is achieved using supervised learning algorithms, such as Naive Bayes.
While, Topic Modeling is a technique used to identify the underlying topics in a large corpus of text. This is achieved using unsupervised learning algorithms, such as Latent Dirichlet Allocation (LDA). Topic Modeling is useful for applications such as content analysis, trend analysis, and document clustering. Let’s look at the key takeaways from this article.
- Topic Modeling is an unsupervised learning technique for identifying patterns and relationships within the data.
- Sentiment Analysis is limited to identifying sentiment polarity, whereas Topic Modeling can identify complex themes and subtopics within the data. This makes Topic Modeling preferable for the analysis of large text corpus.
- We learned about the coherence metric to measure the performance of the model.
- We also got to implement Topic Model pipeline while using the Gensim library of Python.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.