Prateek Joshi — Published On July 30, 2018 and Last Modified On August 26th, 2021
Advanced Classification Machine Learning NLP Project Python Supervised Text Unstructured Data

Introduction

Natural Language Processing (NLP) is a hotbed of research in data science these days and one of the most common applications of NLP is sentiment analysis. From opinion polls to creating entire marketing strategies, this domain has completely reshaped the way businesses work, which is why this is an area every data scientist must be familiar with.

Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete the same task.

In this article, we will learn how to solve the Twitter Sentiment Analysis Practice Problem.

We will do so by following a sequence of steps needed to solve a general sentiment analysis problem. We will start with preprocessing and cleaning of the raw text of the tweets. Then we will explore the cleaned text and try to get some intuition about the context of the tweets. After that, we will extract numerical features from the data and finally use these feature sets to train models and identify the sentiments of the tweets.

This is one of the most interesting challenges in NLP so I’m very excited to take this journey with you!

 

Table of Contents

  1. Understand the Problem Statement
  2. Tweets Preprocessing and Cleaning
  3. Story Generation and Visualization from Tweets
  4. Extracting Features from Cleaned Tweets
  5. Model Building: Sentiment Analysis
  6. What’s Next

1. Understand the Problem Statement

Let’s go through the problem statement once as it is very crucial to understand the objective before working on the dataset. The problem statement is as follows:

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the given test dataset.

Note: The evaluation metric from this practice problem is F1-Score.

Personally, I quite like this task because hate speech, trolling and social media bullying have become serious issues these days and a system that is able to detect such texts would surely be of great use in making the internet and social media a better and bully-free place. Let’s look at each step in detail now.

 

2. Tweets Preprocessing and Cleaning

Take a look at the pictures below depicting two scenarios of an office space – one is untidy and the other is clean and organized. 

You are searching for a document in this office space. In which scenario are you more likely to find the document easily? Of course, in the less cluttered one because each item is kept in its proper place. The data cleaning exercise is quite similar. If the data is arranged in a structured format then it becomes easier to find the right information.

The preprocessing of the text data is an essential step as it makes the raw text ready for mining, i.e., it becomes easier to extract information from the text and apply machine learning algorithms to it. If we skip this step then there is a higher chance that you are working with noisy and inconsistent data. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text.

In one of the later stages, we will be extracting numeric features from our Twitter text data. This feature space is created using all the unique words present in the entire data. So, if we preprocess our data well, then we would be able to get a better quality feature space.

Let’s first read our data and load the necessary libraries. You can download the datasets from here.

import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline
train  = pd.read_csv('train_E6oV3lV.csv')
test = pd.read_csv('test_tweets_anuFYb8.csv')

Let’s check the first few rows of the train dataset.

train.head()

The data has 3 columns id, label, and tweet. label is the binary target variable and tweet contains the tweets that we will clean and preprocess.

Initial data cleaning requirements that we can think of after looking at the top 5 records:

  • The Twitter handles are already masked as @user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet.
  • We can also think of getting rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different kinds of tweets.
  • Most of the smaller words do not add much value. For example, ‘pdx’, ‘his’, ‘all’. So, we will try to remove them as well from our data.
  • Once we have executed the above three steps, we can split every tweet into individual words or tokens which is an essential step in any NLP task.
  • In the 4th tweet, there is a word ‘love’. We might also have terms like loves, loving, lovable, etc. in the rest of the data. These terms are often used in the same context. If we can reduce them to their root word, which is ‘love’, then we can reduce the total number of unique words in our data without losing a significant amount of information.

 

A) Removing Twitter Handles (@user)

As mentioned above, the tweets contain lots of twitter handles (@user), that is how a Twitter user acknowledged on Twitter. We will remove all these twitter handles from the data as they don’t convey much information.

For our convenience, let’s first combine train and test set. This saves the trouble of performing the same steps twice on test and train.

combi = train.append(test, ignore_index=True)

Given below is a user-defined function to remove unwanted text patterns from the tweets. It takes two arguments, one is the original string of text and the other is the pattern of text that we want to remove from the string. The function returns the same input string but without the given pattern. We will use this function to remove the pattern ‘@user’ from all the tweets in our data.

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
        
    return input_txt    

Now let’s create a new column tidy_tweet, it will contain the cleaned and processed tweets. Note that we have passed “@[\w]*” as the pattern to the remove_pattern function. It is actually a regular expression which will pick any word starting with ‘@’.

# remove twitter handles (@user)
combi['tidy_tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*")

 

B) Removing Punctuations, Numbers, and Special Characters

As discussed, punctuations, numbers and special characters do not help much. It is better to remove them from the text just as we removed the twitter handles. Here we will replace everything except characters and hashtags with spaces.

# remove special characters, numbers, punctuations
combi['tidy_tweet'] = combi['tidy_tweet'].str.replace("[^a-zA-Z#]", " ")

 

C) Removing Short Words

We have to be a little careful here in selecting the length of the words which we want to remove. So, I have decided to remove all the words having length 3 or less. For example, terms like “hmm”, “oh” are of very little use. It is better to get rid of them.

combi['tidy_tweet'] = combi['tidy_tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

Let’s take another look at the first few rows of the combined dataframe.

combi.head()

You can see the difference between the raw tweets and the cleaned tweets (tidy_tweet) quite clearly. Only the important words in the tweets have been retained and the noise (numbers, punctuations, and special characters) has been removed.

 

D) Tokenization

Now we will tokenize all the cleaned tweets in our dataset. Tokens are individual terms or words, and tokenization is the process of splitting a string of text into tokens.

tokenized_tweet = combi['tidy_tweet'].apply(lambda x: x.split())
tokenized_tweet.head()

 

E) Stemming

Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”.

from nltk.stem.porter import *
stemmer = PorterStemmer()

tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming
tokenized_tweet.head()

Now let’s stitch these tokens back together.

for i in range(len(tokenized_tweet)):
    tokenized_tweet[i] = ' '.join(tokenized_tweet[i])

combi['tidy_tweet'] = tokenized_tweet

 

 

3. Story Generation and Visualization from Tweets

In this section, we will explore the cleaned tweets text. Exploring and visualizing data, no matter whether its text or any other data, is an essential step in gaining insights. Do not limit yourself to only these methods told in this tutorial, feel free to explore the data as much as possible.

Before we begin exploration, we must think and ask questions related to the data in hand. A few probable questions are as follows:

  • What are the most common words in the entire dataset?
  • What are the most common words in the dataset for negative and positive tweets, respectively?
  • How many hashtags are there in a tweet?
  • Which trends are associated with my dataset?
  • Which trends are associated with either of the sentiments? Are they compatible with the sentiments?

 

A) Understanding the common words used in the tweets: WordCloud

Now I want to see how well the given sentiments are distributed across the train dataset. One way to accomplish this task is by understanding the common words by plotting wordclouds.

A wordcloud is a visualization wherein the most frequent words appear in large size and the less frequent words appear in smaller sizes.

Let’s visualize all the words our data using the wordcloud plot.

all_words = ' '.join([text for text in combi['tidy_tweet']])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

We can see most of the words are positive or neutral. With happy and love being the most frequent ones. It doesn’t give us any idea about the words associated with the racist/sexist tweets. Hence, we will plot separate wordclouds for both the classes(racist/sexist or not) in our train data.

B) Words in non racist/sexist tweets

normal_words =' '.join([text for text in combi['tidy_tweet'][combi['label'] == 0]])

wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(normal_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

We can see most of the words are positive or neutral. With happy, smile, and love being the most frequent ones. Hence, most of the frequent words are compatible with the sentiment which is non racist/sexists tweets. Similarly, we will plot the word cloud for the other sentiment. Expect to see negative, racist, and sexist terms.

C) Racist/Sexist Tweets

negative_words = ' '.join([text for text in combi['tidy_tweet'][combi['label'] == 1]])
wordcloud = WordCloud(width=800, height=500,
random_state=21, max_font_size=110).generate(negative_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

As we can clearly see, most of the words have negative connotations. So, it seems we have a pretty good text data to work on. Next we will the hashtags/trends in our twitter data.

 

D) Understanding the impact of Hashtags on tweets sentiment

Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time. We should try to check whether these hashtags add any value to our sentiment analysis task, i.e., they help in distinguishing tweets into the different sentiments.

For instance, given below is a tweet from our dataset: 

The tweet seems sexist in nature and the hashtags in the tweet convey the same feeling.

We will store all the trend terms in two separate lists — one for non-racist/sexist tweets and the other for racist/sexist tweets.

# function to collect hashtags
def hashtag_extract(x):
    hashtags = []
    # Loop over the words in the tweet
    for i in x:
        ht = re.findall(r"#(\w+)", i)
        hashtags.append(ht)

    return hashtags
# extracting hashtags from non racist/sexist tweets

HT_regular = hashtag_extract(combi['tidy_tweet'][combi['label'] == 0])

# extracting hashtags from racist/sexist tweets
HT_negative = hashtag_extract(combi['tidy_tweet'][combi['label'] == 1])

# unnesting list
HT_regular = sum(HT_regular,[])
HT_negative = sum(HT_negative,[])

Now that we have prepared our lists of hashtags for both the sentiments, we can plot the top n hashtags. So, first let’s check the hashtags in the non-racist/sexist tweets.

Non-Racist/Sexist Tweets

a = nltk.FreqDist(HT_regular)
d = pd.DataFrame({'Hashtag': list(a.keys()),
                  'Count': list(a.values())})
# selecting top 10 most frequent hashtags     
d = d.nlargest(columns="Count", n = 10) 
plt.figure(figsize=(16,5))
ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

All these hashtags are positive and it makes sense. I am expecting negative terms in the plot of the second list. Let’s check the most frequent hashtags appearing in the racist/sexist tweets.

Racist/Sexist Tweets

b = nltk.FreqDist(HT_negative)
e = pd.DataFrame({'Hashtag': list(b.keys()), 'Count': list(b.values())})
# selecting top 10 most frequent hashtags
e = e.nlargest(columns="Count", n = 10)   
plt.figure(figsize=(16,5))
ax = sns.barplot(data=e, x= "Hashtag", y = "Count")
ax.set(ylabel = 'Count')
plt.show()

As expected, most of the terms are negative with a few neutral terms as well. So, it’s not a bad idea to keep these hashtags in our data as they contain useful information. Next, we will try to extract features from the tokenized tweets.

 

4. Extracting Features from Cleaned Tweets

To analyze a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using assorted techniques – Bag-of-Words, TF-IDF, and Word Embeddings. In this article, we will be covering only Bag-of-Words and TF-IDF.

 

Bag-of-Words Features

Bag-of-Words is a method to represent text into numerical features. Consider a corpus (a collection of texts) called C of D documents {d1,d2…..dD} and N unique tokens extracted out of the corpus C. The N tokens (words) will form a list, and the size of the bag-of-words matrix M will be given by D X N. Each row in the matrix M contains the frequency of tokens in document D(i).

Let us understand this using a simple example. Suppose we have only 2 document

D1: He is a lazy boy. She is also lazy.

D2: Smith is a lazy person.

The list created would consist of all the unique tokens in the corpus C.

= [‘He’,’She’,’lazy’,’boy’,’Smith’,’person’]

Here, D=2, N=6

The matrix M of size 2 X 6 will be represented as –

Now the columns in the above matrix can be used as features to build a classification model. Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus.

from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
# bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(combi['tidy_tweet'])

TF-IDF Features

This is another method which is based on the frequency method but it is different to the bag-of-words approach in the sense that it takes into account, not just the occurrence of a word in a single document (or tweet) but in the entire corpus.

TF-IDF works by penalizing the common words by assigning them lower weights while giving importance to words which are rare in the entire corpus but appear in good numbers in few documents.

Let’s have a look at the important terms related to TF-IDF:

  • TF = (Number of times term t appears in a document)/(Number of terms in the document)
  • IDF = log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.
  • TF-IDF = TF*IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
# TF-IDF feature matrix
tfidf = tfidf_vectorizer.fit_transform(combi['tidy_tweet'])

 

5. Model Building: Sentiment Analysis

We are now done with all the pre-modeling stages required to get the data in the proper form and shape. Now we will be building predictive models on the dataset using the two feature set — Bag-of-Words and TF-IDF.

We will use logistic regression to build the models. It predicts the probability of occurrence of an event by fitting data to a logit function.

The following equation is used in Logistic Regression:

Read this article to know more about Logistic Regression.

Note: If you are interested in trying out other machine learning algorithms like RandomForest, Support Vector Machine, or XGBoost, then we have a free full-fledged course on Sentiment Analysis for you.

 

A) Building model using Bag-of-Words features

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

train_bow = bow[:31962,:]
test_bow = bow[31962:,:]

# splitting data into training and validation set
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], random_state=42test_size=0.3)

lreg = LogisticRegression()
lreg.fit(xtrain_bow, ytrain) # training the model

prediction = lreg.predict_proba(xvalid_bow) # predicting on the validation set
prediction_int = prediction[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)

f1_score(yvalid, prediction_int) # calculating f1 score

Output: 0.53

We trained the logistic regression model on the Bag-of-Words features and it gave us an F1-score of 0.53 for the validation set. Now we will use this model to predict for the test data.

test_pred = lreg.predict_proba(test_bow)
test_pred_int = test_pred[:,1] >= 0.3
test_pred_int = test_pred_int.astype(np.int)
test['label'] = test_pred_int
submission = test[['id','label']]
submission.to_csv('sub_lreg_bow.csv', index=False) # writing data to a CSV file

The public leaderboard F1 score is 0.567. Now we will again train a logistic regression model but this time on the TF-IDF features. Let’s see how it performs.

 

B) Building model using TF-IDF features

train_tfidf = tfidf[:31962,:]
test_tfidf = tfidf[31962:,:]

xtrain_tfidf = train_tfidf[ytrain.index]
xvalid_tfidf = train_tfidf[yvalid.index]

lreg.fit(xtrain_tfidf, ytrain)

prediction = lreg.predict_proba(xvalid_tfidf)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)

f1_score(yvalid, prediction_int)

Output: 0.544

The validation score is 0.544 and the public leaderboard F1 score is 0.564. So, by using the TF-IDF features, the validation score has improved and the public leaderboard score is more or less the same.

 

6. What’s Next?

If you are interested to learn about more techniques for Sentiment Analysis, we have a well laid out video course on NLP for you.This course is designed for people who are looking to get into the field of Natural Language Processing. It provides you everything you need to know to become an NLP practitioner.

Key topics covered in the course:

  • Extracting named entities from the text
  • Topic Modelling
  • Feature engineering for text
  • Text classification
  • Deep Learning for NLP
  • 3 real life projects

 

End Notes

In this article, we learned how to approach a sentiment analysis problem. We started with preprocessing and exploration of data. Then we extracted features from the cleaned text using Bag-of-Words and TF-IDF. Finally, we were able to build a couple of models using both the feature sets to classify the tweets.

Did you find this article useful? Do you have any useful trick? Did you use any other method for feature extraction? Feel free to discuss your experiences in comments below or on the discussion portal and we’ll be more than happy to discuss.

Full Code: https://github.com/prateekjoshi565/twitter_sentiment_analysis/blob/master/code_sentiment_analysis.ipynb

About the Author

Prateek Joshi
Prateek Joshi

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

58 thoughts on "Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset and code"

tom mcgrory
tom mcgrory says: July 31, 2018 at 2:44 am
Thanks you for your work on the twitter sentiment in the article is, there any way to get the article in PDF format? I am new to NLTP / NLTK and would like to work through the article as I look at my own dataset but it is difficult scrolling back and forth as I work. Reply
Nicholas Kemp
Nicholas Kemp says: July 31, 2018 at 7:49 am
Hello I can't seem to find the data Reply
Prateek Joshi
Prateek Joshi says: July 31, 2018 at 10:41 am
Hi Nicholas, You can download the datasets from here. Reply
Ateeque Shaikh
Ateeque Shaikh says: July 31, 2018 at 11:54 am
Can you post R code as well Reply
T G Harsha Vardhan
T G Harsha Vardhan says: August 03, 2018 at 9:52 am
Hi,Good article.How the raw tweets are given a sentiment(Target variable) and made it into a supervised learning.Is it done by polarity algorithms(text blob)??..In twitter analysis,how the target variable(sentiment) is mapped to incoming tweet is more crucial than classification. Isn't it?? Reply
Prateek Joshi
Prateek Joshi says: August 03, 2018 at 11:40 am
Thanks for appreciating. The raw tweets were labeled manually. Reply
NAGA PRUDHVI
NAGA PRUDHVI says: August 03, 2018 at 7:21 pm
Hi this was good explination. But how can our model or system knows which are happy words and which are racist/sexist words. Reply
Lilya
Lilya says: August 06, 2018 at 7:44 am
Hi, Thank you for your kind information, but I have one question that in this part, you just analyze the sentiment of single rather than the whole sentence, so some bad circumstance may happen such as racialism with negative word, this may generate the opposite meaning. Reply
Prateek Joshi
Prateek Joshi says: August 09, 2018 at 5:31 pm
Hi Lilya, I am not considering sentiment of a single word, but the entire tweet. For example, word2vec features for a single tweet have been generated by taking average of the word2vec vectors of the individual words in that tweet. Reply
Prateek Joshi
Prateek Joshi says: August 09, 2018 at 5:37 pm
Hi, Glad you liked it. I guess you are referring to the wordclouds generated for positive and negative sentiments. Please note that I have used train dataset for ploting these wordclouds wherein the data is labeled. Reply
Mayank
Mayank says: August 09, 2018 at 5:54 pm
Importing module nltk.tokenize.moses is raising ModuleNotFound error. Also, it doesn't seems to be there in NLTK3.3. Can anybody confirm? Reply
Prateek Joshi
Prateek Joshi says: August 09, 2018 at 6:54 pm
Thanks Mayank for pointing it out. I have updated the code. Reply
Jash
Jash says: August 27, 2018 at 11:17 am
Still, I cannot find the data file. Please help. Reply
Aishwarya Singh
Aishwarya Singh says: August 27, 2018 at 12:12 pm
Hi Jash, Prateek has provided the link to the practice problem on datahack. Please register in the competition using the link provided. Once you do that, you will be able to download the dataset (train, test and submission files will be available after the problem statement at the bottom of the page). If you still face any issue, please let us know. Reply
Prateek Gupta
Prateek Gupta says: August 28, 2018 at 3:08 pm
Hi Prateek, I am getting NameError: name 'train' is not defined in this line- xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], random_state=42, test_size=0.3) I think you missed to mention how you separated and store the target variable. Reply
Prateek Joshi
Prateek Joshi says: August 29, 2018 at 4:39 pm
Hi, I am not getting this error. Make sure you have not missed any code. Thanks Reply
Ravinder Ahuja
Ravinder Ahuja says: September 20, 2018 at 1:15 pm
Dear ITS NICE ARTICLE WITH GOOD EXPLANATION BUT I AM GETTING ERROR: ValueError: empty vocabulary; perhaps the documents only contain stop words. PLEASE HELP ME TO RESOLVE THIS. Thanks Reply
Prateek Joshi
Prateek Joshi says: September 20, 2018 at 2:29 pm
Hi Ravinder, Which part of the code is giving you this error? Regards, Prateek Reply
Ravinder
Ravinder says: September 21, 2018 at 1:03 am
resolved....Thanks Reply
Ravinder Ahuja
Ravinder Ahuja says: September 21, 2018 at 1:20 am
Hi not able to print word cloud showing error ValueError: We need at least 1 word to plot a word cloud, got 0. Reply
Dilip Kumar
Dilip Kumar says: September 22, 2018 at 10:56 pm
very nice explaination sir,this is really helpful sir Reply
Muhammad Younus
Muhammad Younus says: September 23, 2018 at 11:14 am
Best article, you explain everything very nicely,Thanks Reply
Gaurav Rai
Gaurav Rai says: October 05, 2018 at 7:53 pm
Hi Prateek, I am getting error for the sttiching together of tokens section: for i in range(len(tokenized_tweet)): tokenized_tweet[i] = ' '.join(tokenized_tweet[i]) combi['tidy_tweet'] = tokenized_tweet I indented the code in the loop but still i am getting below error: for i in range(len(tokenized_tweet)): tokenized_tweet[i] = ' '.join(tokenized_tweet[i]) combi['tidy_tweet'] = tokenized_tweet Reply
Gaurav Rai
Gaurav Rai says: October 05, 2018 at 9:57 pm
Hi, For my previous comment i tried this and it worked: for i in range(len(tokenized_tweet)): s = "" for j in tokenized_tweet.iloc[i]: s += ''.join(j)+' ' tokenized_tweet.iloc[i] = s.rstrip() Thanks for your time on this Reply
tejeshwari
tejeshwari says: October 06, 2018 at 11:46 am
Hi, I am registered on https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/#data_dictionary, but still unable to download the twitter dataset. Reply
Prateek Joshi
Prateek Joshi says: October 06, 2018 at 5:15 pm
Great! Reply
Prateek Joshi
Prateek Joshi says: October 06, 2018 at 5:23 pm
Hi Tejeshwari, you can find the download links just above the solution checker at the contest page. Reply
Suriya
Suriya says: October 06, 2018 at 6:03 pm
can u send me the full code please. Reply
Aishwarya Singh
Aishwarya Singh says: October 08, 2018 at 10:45 am
The code is present in the article itself Reply
humera hassan
humera hassan says: October 21, 2018 at 10:43 pm
for i in range(len(tokenized_tweet)): s = “” for j in tokenized_tweet.iloc[i]: s += ”.join(j)+’ ‘ tokenized_tweet.iloc[i] = s.rstrip() i am getting error for this code as : File "", line 2 s = “” ^ IndentationError: expected an indented block Reply
Prateek Joshi
Prateek Joshi says: October 22, 2018 at 2:47 pm
Hi, you have to indent after `for j in tokenized_tweet.iloc[i]:` Reply
Ziza
Ziza says: November 03, 2018 at 5:04 pm
Hi, In the beginning when you perform this step # remove twitter handles (@user) combi['tidy_tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*") Do you need to convert combi['tweet'] pandas.Series to string or byte-like object? I couldn't pass in a pandas.Series without converting it first! Reply
Prateek Joshi
Prateek Joshi says: November 03, 2018 at 5:15 pm
Hi Ziza, The code is working fine at my end. I didn't convert combi[‘tweet’] to any other type. Reply
Ziza
Ziza says: November 04, 2018 at 11:49 am
Yeah, when I used your dataset everything worked just fine. I was actually trying that on another dataset, I guess I should pre-process those data. Thanks for your reply! Reply
Jingmiao Shen
Jingmiao Shen says: November 08, 2018 at 12:21 am
Great Great article!!! Reply
pablo
pablo says: November 19, 2018 at 6:34 pm
Hi, excellent job with this article. I have started to learn machine learning to implement it in my django projects and this helped so much. I just have one thing to add. The stemmer that you used is behaving weird, i.e. changing 'this' to 'thi'. I have checked in the official repository and it is a known issue. So my advice would be to change it to stemming. It can be installed from pip, and you just use it like: `from stemming.porter2 import stem` stem('this') 'this' After changing to that stemmer the wordcloud started to look more accurate. Thanks again for the article! Reply
Prateek Joshi
Prateek Joshi says: November 20, 2018 at 3:18 pm
Thanks Pablo for the feedback. Reply
Zarief Marzuki Lao
Zarief Marzuki Lao says: December 08, 2018 at 8:30 am
I was facing the same problem and was in a 'newbie-stuck' stage, where has all the s, i, e, y gone !!? Now I can proceed and continue to learn. This step by step tutorial is awesome. Many thanks to both Prateek & Pablo Reply
Vishwa Dadhania
Vishwa Dadhania says: December 31, 2018 at 1:39 pm
Hi, Even after logging in I am not finding any link to download the dataset anywhere on the page. Is it because the practice problem competition is already over? Thanks & Regards Reply
Sharik
Sharik says: January 02, 2019 at 4:30 pm
Hey, Prateek Even I am getting the same error. NameError: name 'train' is not defined And, even if you have a look at the code provided in the step 5 A) Building model using Bag-of-Words features. There is no variable declared as "train" it is either "train_bow" or "test_bow". So while splitting the data there is an error when the interpreter encounters "train['label']". Please look into it. Thanks :) Reply
Prateek Joshi
Prateek Joshi says: January 15, 2019 at 11:35 am
Thanks Jingmiao Reply
Prateek Joshi
Prateek Joshi says: January 15, 2019 at 11:40 am
Hi Sharik, I have read the train data in the beginning of the article. Please run the entire code. Reply
Eesha Chinchwadkar
Eesha Chinchwadkar says: January 22, 2019 at 5:30 pm
Hi Prateek, train_bow = bow[:31962, :] test_bow = bow[31962:, :] What is 31962 here? I am actually trying this on a different dataset to classify tweets into 4 affect categories. The length of my training set is 3960 and that of testing set is 3142. Reply
Prateek Joshi
Prateek Joshi says: January 22, 2019 at 5:57 pm
Hi Eesha, Here 31962 is the size of the training set. You may use 3960 instead. Regards Reply
Nidhi Sandilya
Nidhi Sandilya says: February 04, 2019 at 7:41 pm
Hi Prateek, This is wonderfully written and carefully explained article, it is a very good read. Thank you for penning this down. Reply
Prateek Joshi
Prateek Joshi says: February 06, 2019 at 11:24 am
Thanks Nidhi :-) Reply
Anant Vignesh
Anant Vignesh says: February 11, 2019 at 2:21 am
Hi Prateek, I just wanted to know where are you getting the label values? Where are you calculating it? Because if you are scrapping the tweets from twitter it does not come with that field. So how are you determining whether it is a positive or a negative tweet? Reply
Prateek Joshi
Prateek Joshi says: February 11, 2019 at 10:53 am
Hi Anant, This dataset was manually labeled. Reply
Caroline
Caroline says: February 12, 2019 at 4:01 pm
Sir ..This was a good article i've gone through....Could you please share me the entire code so that i could use it as reference for my project..... Reply
Rathna Priya
Rathna Priya says: February 12, 2019 at 4:05 pm
Can you share your full working code with all the datasets needed Reply
Prateek Joshi
Prateek Joshi says: February 12, 2019 at 4:34 pm
I have already shared the link to the full code at the end of the article. Please check. Reply
Prateek Joshi
Prateek Joshi says: February 12, 2019 at 4:37 pm
Hi Caroline, I have already shared the link to the full code at the end of the article. Please check. Reply
Prateek Joshi
Prateek Joshi says: February 19, 2019 at 1:44 pm
Hi Tom, The entire code has been shared in the end. Feel free to use it. Regards, Prateek Joshi Reply
Sk
Sk says: March 03, 2019 at 9:03 am
Beautiful article with great explanation! Thank you for your effort. Reply
Unmesh kadam
Unmesh kadam says: March 03, 2019 at 3:13 pm
Sir this is wonderful article, excellent work. Can we increase the F1 score?..plz suggest some method Reply
Prateek Joshi
Prateek Joshi says: March 03, 2019 at 5:22 pm
I am glad you liked it. Reply
Gnanaprakash
Gnanaprakash says: March 03, 2019 at 5:40 pm
WOW!!! Such a great article.. can you tell me how to categorize health related tweets like fever,malaria,dengue etc. instead of hate speech Reply
Prateek Joshi
Prateek Joshi says: March 03, 2019 at 7:12 pm
Hi, You have to arrange health-related tweets first on which you can train a text classification model. That model would then be useful for your use case. Thanks and regards, Prateek Reply

Leave a Reply Your email address will not be published. Required fields are marked *