Using Data Science to Identify Top Twitter Influencers

Deepanshi Bajpai 21 Jul, 2023 • 11 min read


The significance of influencer marketing on Twitter cannot be ignored, especially when it comes to benefiting businesses. In this article, we’ll explore a fascinating concept: using data science and Python to find the top twitter influencers. This technique can help businesses make smart choices and reap rewards on Twitter. By applying scientific methods and Python’s capabilities, businesses gain the power to identify influencers who can bring about immense brand exposure and engagement.

The article covers a range of influencer marketing topics, including the factors for selecting influencers, collecting and organizing Twitter data, analyzing data using data science techniques, and utilizing machine learning algorithms to assess and rank influencers.

Top Twitter Influencers | Data Science | Python

Learning Objectives

The article aims to help readers achieve specific learning objectives. By the end of this piece, readers will:

  1. Grasp the significance of influencer marketing on Twitter and how it benefits businesses.
  2. Acquire knowledge about using data science and Python to find suitable influencers.
  3. Learn the factors and aspects to consider when identifying influencers on Twitter.
  4. Discover techniques to collect and organize Twitter data using Python and related tools.
  5. Develop skills in analyzing Twitter data using data science techniques and Python libraries like Pandas.
  6. Explore the usage of machine learning algorithms for influencer identification and ranking.
  7. Master the art of assessing influencers based on relevant metrics and qualitative factors.
  8. Understand the limitations and challenges tied to identifying influencers on Twitter.
  9. Gain insights from real-world influencer marketing case studies and learn key lessons.
  10. Apply the acquired knowledge and skills to identify the best influencers for their own business on Twitter using Python.

This article was published as a part of the Data Science Blogathon.

Project Description

The objective of the project is to empower readers with the skills and knowledge required to navigate the intricate domain of influencer marketing on Twitter. We will delve into several components, such as establishing the selection criteria for influencers, gathering and preparing pertinent Twitter data, analyzing the data using data science techniques, and employing machine learning algorithms to assess and rank influencers. The systematic approach provided in this article will equip readers with valuable insights and practical strategies to streamline their marketing endeavours.

Through this article, readers will acquire a profound understanding of the influencer identification process and its pivotal role in amplifying brand visibility and engagement on Twitter. By the end of the project culmination, readers will be able to confidently apply their newfound knowledge to their own businesses, enhancing their marketing tactics and effectively connecting with their desired audience by leveraging influential figures on Twitter.

Problem Statement

Top Twitter Influencers | Data Science | Python

Identifying relevant and impactful influencers for businesses on Twitter can be a complex problem. Businesses often struggle to find the right influencers due to the overwhelming amount of data and the ever-changing social media landscape. It becomes even more challenging to identify influencers with genuine engagement and

Businesses face obstacles when manually sifting through large volumes of Twitter data to find influencers who align with their target audience and brand values. Determining the authenticity and influence of influencers can be a subjective and time-consuming task. These challenges often result in missed opportunities and ineffective partnerships, wasting resources and compromising marketing strategies.

Thankfully, data science techniques provide a solution. By using data-driven approaches, businesses can analyze extensive datasets and extract valuable insights to identify influencers based on important metrics like follower count, engagement rate, and topic relevance. Machine learning algorithms further simplify the process by automating influencer evaluation and ranking.

Adopting data science techniques enables businesses to overcome the challenges of finding relevant and impactful influencers on Twitter. This empowers them to make informed choices, optimize their marketing efforts, and collaborate with influencers who can genuinely enhance brand exposure and foster authentic engagement.

Understanding Influencer Marketing

Gaining a clear understanding of influencer marketing is vital in the modern digital landscape. Influencer marketing involves collaborating with people who have a large following and a strong influence on their audience. These influencers assist businesses in promoting their products or services on Twitter, leading to increased brand awareness, engagement, and sales.

The significance of influencer marketing lies in the concept of social proof. When consumers witness influencers endorsing a product or sharing their experiences, it builds trust and reliability. Influencers have amassed a devoted and engaged following, providing businesses with access to a specific group of people.

Employing influencers on Twitter offers several benefits. Firstly, it enables businesses to leverage the existing audience of influencers, saving the time and energy required to build their own following. Secondly, influencers possess a deep understanding of their audience’s preferences, allowing them to create content that resonates well and boosts the chances of successful promotion. Lastly, influencers can offer genuine and relatable recommendations that heavily impact consumers’ purchasing decisions.

Selecting the appropriate influencers is pivotal for businesses to maximize the impact of influencer marketing. By choosing influencers who share the brand’s values, businesses can ensure authenticity and establish a strong connection with the intended audience. Moreover, considering factors like reach, engagement, and relevance to the industry or niche helps businesses find influencers who can effectively convey the brand’s message and generate favourable outcomes.

The right influencers possess the ability to expand a business’s reach, enhance brand visibility, and foster customer engagement. Having a solid comprehension of influencer marketing and capitalizing on the influence of influencers on Twitter can prove transformative for businesses aiming to grow their online presence and connect with their desired audience.

Defining the Criteria for Identifying Influencers

Let’s imagine a scenario with Editech (, a provider of professional academic writing services that has been serving clients across India for several years. Their services range from crafting statements of purpose, letters of recommendation, academic essays, building resumes, and even providing writing consultation services. Now they’re searching for an influencer to boost their brand on Twitter. The identification of the perfect influencer involves several considerations.

Editech | Top Twitter Influencers | Data Science | Python


The first point to ponder is the influencer’s relevance. The influencer’s content should resonate with what Editech offers. For example, an influencer who often talks about academic writing or overseas education from India would be a suitable match.


Engagement is another important factor. An influencer with a high level of engagement suggests that their followers are actively participating in their content. High levels of likes, comments, and retweets indicate that the influencer’s audience pays attention and reacts, making their endorsement more impactful. Editech should seek influencers with an engagement rate of at least 1-3% to ensure that the influencer can spark interest and dialogue among their followers.


The reach of the influencer’s audience also matters. Editech should aim for influencers with a substantial following to expand the reach and exposure of their brand. The influencer’s follower count can predict the potential exposure of Editech’s services. However, it is essential to strike a balance. Micro-influencers with a smaller following but a highly engaged audience can also be valuable, particularly in specific markets. For our purposes, a reasonable benchmark would be influencers with at least 10,000 followers.


Authenticity plays a significant role in selecting influencers. Editech should prioritize influencers who genuinely believe in their services and can present authentic endorsements. This would help to establish trust and credibility among their audience, increasing the chances of conversions. This can be assessed through the influencer’s previous endorsements and personal branding.

The factors of relevance, engagement, reach, and authenticity significantly contribute to the success of a marketing campaign. By selecting influencers who are relevant to Editech’s industry, have an engaged audience, possess a wide reach, and maintain authenticity, Editech enhances the chances of capturing their target audience’s
attention, increasing brand awareness, and ultimately converting potential customers.

Gathering & Preparing Twitter Data

Gathering and preparing Twitter data is a crucial step in the identification of influencers for your business. The Twitter API serves as a vital tool for collecting the data necessary for influencer identification.

The Twitter API enables developers to access and retrieve data from Twitter’s extensive database. To access Twitter data using the API, it is necessary to go through an
authentication process. This process entails creating a Twitter Developer account, generating an application, and acquiring the requisite access tokens and API keys. These tokens and keys are essential for establishing a secure connection and obtaining permission to access Twitter data.

Python provides several libraries that facilitate working with the Twitter API. One popular library is Tweepy. Tweepy simplifies the process of interacting with the Twitter API by handling authentication and providing convenient methods to retrieve data.

To initiate the use of Tweepy, one must install the library using pip, a package manager for Python. Here’s an example python code snippet demonstrating how to authenticate and retrieve data using Tweepy:

import tweepy
import pandas as pd

# Set up your Twitter API credentials
consumer_key = "your_consumer_key"
consumer_secret = "your_consumer_secret"
access_token = "your_access_token"
access_token_secret = "your_access_token_secret"

# Authenticate with Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Create an API object
api = tweepy.API(auth)

# Search for influencers talking about statement 
# of purpose or academic writing
query = "statement of purpose OR academic writing"
influencers = []

# Iterate through search results
for tweet in tweepy.Cursor(, q=query, 
    if hasattr(tweet, 'retweeted_status'):
        text = tweet.retweeted_status.full_text
        text = tweet.full_text
        'username': tweet.user.screen_name,
        'text': text,
        'created_at': tweet.created_at,
        'retweet_count': tweet.retweet_count,
        'favorite_count': tweet.favorite_count

# Create a DataFrame from the influencer data
influencer_df = pd.DataFrame(influencers)

# Calculate the follower count and engagement rate
influencer_df['follower_count'] = 
influencer_df['username'].apply(lambda username: api.get_user(username).followers_count)
influencer_df['engagement_rate'] = 
(influencer_df['retweet_count'] + influencer_df['favorite_count']) / influencer_df['follower_count']

# Filter influencers based on reach, 
# engagement rate, and topic relevance
min_follower_count = 10000
min_engagement_rate = 0.03
relevant_keywords = ['statement of purpose', 
'academic writing', 'university admission']

filtered_influencers = influencer_df[
    (influencer_df['follower_count'] >= min_follower_count) &
    (influencer_df['engagement_rate'] >= min_engagement_rate) &
    ('|'.join(relevant_keywords), case=False))

# Display the filtered influencers

Further, we use the Twitter API’s search functionality to find influencers who are talking about the statement of purpose or academic writing. The query variable represents the search query with the desired keywords. We create an empty list called influencers to store the extracted influencer data. We use a for loop with tweepy.Cursor to iterate through the search results. The parameter tweet_mode=’extended’ ensures that we retrieve the full text of tweets, including any extended content.

If a tweet is a retweet, we access the full text using retweeted_status.full_text. Otherwise, we access the full text directly with tweet.full_text. We then append the username and text of each tweet to the influencers list as a dictionary.

Analyzing Twitter Data

To enhance the analysis of the filtered influencers, we will perform topic analysis, sentiment analysis, and influence scoring. These steps help us gain deeper insights into the influencers’ characteristics and assess their potential impact.

For topic analysis, we examine the text of each tweet in the filtered influencers’ dataset. By using the TextBlob library, we extract part-of-speech tags that provide a comprehensive understanding of the discussed topics. These tags help us categorize and analyze the content of the tweets more effectively. We then add the extracted topics to the ‘topics’ column in the filtered influencers’ dataset.

Next, we focus on sentiment analysis. Leveraging the TextBlob library, we analyze the sentiment expressed in the text of each tweet. This process assigns a sentiment polarity score, indicating whether the sentiment is positive, negative, or neutral. These sentiment scores offer valuable insights into the influencers’ overall sentiment towards the subject matter. We store the sentiment polarity scores in the ‘sentiment’ column of the filtered influencers’ dataset.

Influence scoring is a critical aspect of the analysis. To quantify the influencers’ impact, we employ the MinMaxScaler technique. This allows us to normalize the ‘follower_count’,’engagement_rate’, and ‘sentiment’ columns, ensuring a fair evaluation metric. We ensure that each feature contributes proportionally to the overall influence score. By averaging the normalized values across these columns, we calculate a comprehensive influence score for each influencer. These influence scores are stored in the ‘influence_score’ column of the filtered influencers’ dataset.

Finally, we have the dataset of filtered influencers, highlighting the outcomes of the additional analysis.

# Perform topic analysis
topics = []
for tweet in filtered_influencers['text']:
    blob = TextBlob(tweet)
filtered_influencers['topics'] = topics

# Perform sentiment analysis
sentiments = []
for tweet in filtered_influencers['text']:
    blob = TextBlob(tweet)
filtered_influencers['sentiment'] = sentiments

# Perform influence scoring
scaler = MinMaxScaler()
filtered_influencers['influence_score'] = 
[['follower_count', 'engagement_rate', 'sentiment']]).

# Display the filtered influencers with the additional analysis

Applying Machine Learning Algorithms

To determine the top 3 influencers from the given dataset, we can utilize machine learning techniques. By creating a predictive model that takes into account various factors such as follower count, engagement rate, sentiment, and other relevant information, we can generate scores that quantify the influence of each influencer. These scores can then be used to rank the influencers and identify the top performers.

In order to achieve this, we will employ a machine learning algorithm known as linear regression. This algorithm will be trained on the available dataset, with the influencer’s influence score serving as the target variable. The features, including follower count, engagement rate, sentiment, and other relevant attributes, will be used as inputs to the model.

Training the Model

After training the model, we can utilize it to predict the influence scores for all the influencers in the dataset. These predicted scores will then be used to rank the influencers in descending order, with the highest predicted scores representing the most influential individuals.

To implement this approach, we will first split the dataset into training and testing sets. The training set will be used to train the linear regression model, while the testing set will be utilized to evaluate the model’s performance. We can calculate metrics such as mean squared error (MSE) and R-squared to assess the accuracy of the

Finally, we can generate the top 3 influencers by selecting the influencers with the highest predicted influence scores. These individuals are expected to have the most significant impact and are likely to be the most effective choices for collaborations.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split the dataset into features (X) and target variable (y)
X = filtered_influencers[['follower_count', 'engagement_rate', 'sentiment']]
y = filtered_influencers['influence_score']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data, y_train)

# Make predictions on the testing data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Rank the influencers based on the predicted influence scores
filtered_influencers['predicted_score'] = model.predict(X)
top_influencers = filtered_influencers.nlargest(3, 'predicted_score')

# Display the top influencers

In this code, we split the dataset into features (follower count, engagement rate, sentiment) and the target variable (influence score). The dataset is further divided into training and testing sets. We then create a linear regression model and train it using the training data. The model is used to make predictions on the testing data, and metrics such as mean squared error (MSE) and R-squared are calculated to evaluate the model’s performance. Next, we apply the trained model to the entire dataset and predict the influence scores for each influencer. Finally, we select the top 3 influencers with the highest predicted influence scores using the nlargest() function and display the results.


Understanding the limitations of the methods and techniques discussed in this article is crucial for readers planning to apply these approaches to their own projects. Being aware of these limitations helps manage expectations and overcome potential challenges that may arise during the implementation process.

  1. One significant limitation is related to data availability and quality. The effectiveness of influencer identification relies heavily on the data collected from Twitter. However, limitations may arise due to factors like rate limits or restrictions imposed by Twitter’s API. Additionally, the accuracy and reliability of the collected data can be influenced by the presence of spam accounts or inaccurate user information.
  2. Another limitation pertains to the selection of relevant keywords and criteria for filtering influencers. Defining the optimal thresholds for criteria like follower count, engagement rate, and topic relevance can be subjective and context-dependent. Different businesses may have diverse requirements and objectives, making it challenging to find the right balance.
  3. Furthermore, the methods employed for topic analysis and sentiment analysis, which rely on natural language processing techniques, have inherent limitations. Automated methods may not capture all nuances and complexities of language, including contextual understanding, sarcasm, and cultural references.
  4. The machine learning model used for influence scoring and ranking influencers has its own set of limitations. The model’s performance heavily relies on the quality and representativeness of the training data. Biases present in the data, such as demographic or sampling biases, can impact the model’s predictions and lead to biased rankings. Careful curation and preprocessing of the training data are necessary to mitigate such biases.


In conclusion, this article has discussed the process of identifying suitable influencers for businesses on Twitter using Python and data science techniques. By leveraging Twitter API, data preprocessing, topic analysis, sentiment analysis, and machine learning algorithms, businesses can improve their influencer marketing strategies and make informed decisions.

Key Takeaways

Some of the key learnings from this project include:

  1. An understanding of Twitter’s developer API and how it can be used to extract any data one may require.
  2. An exposure to Python libraries like Tweepy, Pandas, and TextBlob, that enable efficient data collection, preprocessing, and analysis of Twitter data.
  3. We learnt how to do topic analysis, which helps categorize and analyze the content of influencers’ tweets, offering insights into their areas of expertise.
  4. We also delved into sentiment analysis, that allows businesses to gauge influencers’ sentiment towards specific subjects, ensuring compatibility with brand values.
  5. Finally, we learned how to use machine learning algorithms, such as linear regression, to score and rank influencers based on factors like follower count, engagement rate, and sentiment.

By employing Python and data science techniques, businesses can optimize their influencer marketing, increase brand exposure, encourage authentic engagement, and drive business growth on Twitter.

Frequently Asked Questions

Q1. How can I use Twitter’s API in Python to gather data for influencer identification?

A. Python’s Tweepy library offers functionalities for connecting to Twitter’s API and retrieving relevant data. Tweepy simplifies the authentication process and provides methods for collecting tweets, user profiles, and engagement metrics required for influencer identification.

Q2.  What data science techniques are useful for identifying influencers on Twitter?

A. Data science techniques like topic analysis and sentiment analysis can be applied. Topic analysis helps categorize and understand influencers’ tweet content, while sentiment analysis gauges their sentiment towards specific subjects, ensuring alignment with brand values and target audience.

Q3. How can data science help determine an influencer’s relevance and influence?

A. Analyzing factors such as follower count, engagement rate, sentiment, and topic relevance can provide insights into an influencer’s relevance and influence. Machine learning algorithms can be employed to score and rank influencers based on these factors, aiding in the identification of influential individuals.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Deepanshi Bajpai 21 Jul 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Related Courses
0 Hrs 36 Lessons

Top Data Science Projects for Analysts and Data Scientists


  • [tta_listen_btn class="listen"]