Welcome to our comprehensive data analysis blog that delves deep into the world of Netflix. As one of the leading streaming platforms globally, Netflix has revolutionized how we consume entertainment. With its vast library of movies and TV shows, it offers an abundance of choices for viewers around the world.
Netflix has experienced remarkable growth and expanded its presence to become a dominant force in the streaming industry. Here are some noteworthy statistics that showcase its global impact:
In this blog, we embark on an exciting journey to explore the intriguing patterns, trends, and insights hidden within Netflix’s content landscape. Leveraging the power of Python and its data analysis libraries, we dive into the vast collection of Netflix’s offerings to uncover valuable information that sheds light on content additions, duration distributions, genre correlations, and even the most commonly used words in titles and descriptions.
Through detailed code snippets and visualizations, we peel back the layers of Netflix’s content ecosystem to provide a fresh perspective on how the platform has evolved. By analyzing release patterns, seasonal trends, and audience preferences, we aim better to understand the content dynamics within Netflix’s vast universe.
This article was published as a part of the Data Science Blogathon.
The data used in this case study is sourced from Kaggle, a popular platform for data science and machine learning enthusiasts. The dataset, titled “Netflix Movies and TV Shows,” is publicly available on Kaggle and provides valuable information about the movies and TV shows on the Netflix streaming platform.
The dataset consists of a tabular format containing various columns that describe the different aspects of each movie or TV show. Here is a table summarizing the columns and their descriptions:
Column Name | Description |
---|---|
show_id | Unique ID for every Movie / TV Show |
type | Identifier – A Movie or TV Show |
title | Title of the Movie / TV Show |
director | Director of the Movie |
cast | Actors involved in the Movie / Show |
country | Country where the Movie / Show was produced |
date_added | Date it was added on Netflix |
release_year | Actual Release Year of the Movie / Show |
rating | TV Rating of the Movie / Show |
duration | Total Duration – in minutes or number of seasons |
In this section, we will perform data preparation tasks on the Netflix dataset to ensure its cleanliness and suitability for analysis. We will handle missing values and duplicates and perform data type conversions as needed. Let’s dive into the code and explore each step.
To begin, we import the necessary libraries for data analysis and visualization. These libraries include pandas, numpy, and matplotlib. pyplot, and seaborn. They provide essential functions and tools to manipulate and visualize the data effectively.
# Importing necessary libraries for data analysis and visualization
import pandas as pd # pandas for data manipulation and analysis
import numpy as np # numpy for numerical operations
import matplotlib.pyplot as plt # matplotlib for data visualization
import seaborn as sns # seaborn for enhanced data visualization
Next, we load the Netflix dataset using the pd.read_csv() function. The dataset is stored in the ‘netflix.csv’ file. Let’s look at the first five records of the dataset to understand its structure.
# Loading the dataset from a CSV file
df = pd.read_csv('netflix.csv')
# Displaying the first few rows of the dataset
df.head()
It is crucial to understand the dataset’s overall characteristics through descriptive statistics. We can gain insights into the numerical attributes such as count, mean, standard deviation, minimum, maximum, and quartiles.
# Computing descriptive statistics for the dataset
df.describe()
To get a concise summary of the dataset, we use the df.info() function. It provides information about the number of non-null values and the data types of each column. This summary helps identify missing values and potential issues with data types.
# Obtaining information about the dataset
df.info()
Missing values can hinder accurate analysis. This dataset explores the missing values in each column using df. isnull().sum(). We aim to identify the columns with missing values and determine the percentage of missing data in each column.
# Checking for missing values in the dataset
df.isnull().sum()
To handle missing values, we employ different strategies for different columns. Let’s go through each step:
Duplicates can distort analysis results, so it’s essential to address them. We identify and remove duplicate records using df.duplicated().sum().
# Checking for duplicate rows in the dataset
df.duplicated().sum()
For the ‘director’ and ‘cast’ columns, we replace missing values with ‘No Data’ to maintain data integrity and avoid any bias in the analysis.
# Replacing missing values in the 'director' column with 'No Data'
df['director'].replace(np.nan, 'No Data', inplace=True)
# Replacing missing values in the 'cast' column with 'No Data'
df['cast'].replace(np.nan, 'No Data', inplace=True)
In the ‘country’ column, we fill in missing values with the mode (most frequently occurring value) to ensure consistency and minimize data loss.
# Filling missing values in the 'country' column with the mode value
df['country'] = df['country'].fillna(df['country'].mode()[0])
For the ‘rating’ column, we fill in missing values based on the ‘type’ of the show. We assign the mode of ‘rating’ for movies and TV shows separately.
# Finding the mode rating for movies and TV shows
movie_rating = df.loc[df['type'] == 'Movie', 'rating'].mode()[0]
tv_rating = df.loc[df['type'] == 'TV Show', 'rating'].mode()[0]
# Filling missing rating values based on the type of content
df['rating'] = df.apply(lambda x: movie_rating if x['type'] == 'Movie' and pd.isna(x['rating'])
else tv_rating if x['type'] == 'TV Show' and pd.isna(x['rating'])
else x['rating'], axis=1)
For the ‘duration’ column, we fill in missing values based on the ‘type’ of the show. We assign the mode of ‘duration’ for movies and TV shows separately.
# Finding the mode duration for movies and TV shows
movie_duration_mode = df.loc[df['type'] == 'Movie', 'duration'].mode()[0]
tv_duration_mode = df.loc[df['type'] == 'TV Show', 'duration'].mode()[0]
# Filling missing duration values based on the type of content
df['duration'] = df.apply(lambda x: movie_duration_mode if x['type'] == 'Movie'
and pd.isna(x['duration'])
else tv_duration_mode if x['type'] == 'TV Show'
and pd.isna(x['duration'])
else x['duration'], axis=1)
After handling missing values in specific columns, we drop any remaining rows with missing values to ensure a clean dataset for analysis.
# Dropping rows with missing values
df.dropna(inplace=True)
We convert the ‘date_added’ column to datetime format using pd.to_datetime() to enable further analysis based on date-related attributes.
# Converting the 'date_added' column to datetime format
df["date_added"] = pd.to_datetime(df['date_added'])
We extract additional attributes from the ‘date_added’ column to enhance our analysis capabilities. We remove the month and year values to analyze trends based on these temporal aspects.
# Extracting month, month name, and year from the 'date_added' column
df['month_added'] = df['date_added'].dt.month
df['month_name_added'] = df['date_added'].dt.month_name()
df['year_added'] = df['date_added'].dt.year
To analyze categorical attributes more effectively, we transform them into separate dataframes, allowing for more leisurely exploration and analysis.
For the ‘cast,’ ‘country,’ ‘listed_in,’ and ‘director’ columns, we split the values based on the comma separator and created separate rows for each value. This transformation enables us to analyze the data at a more granular level.
# Splitting and expanding the 'cast' column
df_cast = df['cast'].str.split(',', expand=True).stack()
df_cast = df_cast.reset_index(level=1, drop=True).to_frame('cast')
df_cast['show_id'] = df['show_id']
# Splitting and expanding the 'country' column
df_country = df['country'].str.split(',', expand=True).stack()
df_country = df_country.reset_index(level=1, drop=True).to_frame('country')
df_country['show_id'] = df['show_id']
# Splitting and expanding the 'listed_in' column
df_listed_in = df['listed_in'].str.split(',', expand=True).stack()
df_listed_in = df_listed_in.reset_index(level=1, drop=True).to_frame('listed_in')
df_listed_in['show_id'] = df['show_id']
# Splitting and expanding the 'director' column
df_director = df['director'].str.split(',', expand=True).stack()
df_director = df_director.reset_index(level=1, drop=True).to_frame('director')
df_director['show_id'] = df['show_id']
After completing these data preparation steps, we have a clean and transformed dataset ready for further analysis. These initial data manipulations set the foundation for exploring the Netflix dataset and uncovering insights into the streaming platform’s data-driven strategies.
To determine the distribution of content in the Netflix library, we can calculate the percentage distribution of content types (movies and TV shows) using the following code:
# Calculate the percentage distribution of content types
x = df.groupby(['type'])['type'].count()
y = len(df)
r = ((x/y) * 100).round(2)
# Create a DataFrame to store the percentage distribution
mf_ratio = pd.DataFrame(r)
mf_ratio.rename({'type': '%'}, axis=1, inplace=True)
# Plot the 3D-effect pie chart
plt.figure(figsize=(12, 8))
colors = ['#b20710', '#221f1f']
explode = (0.1, 0)
plt.pie(mf_ratio['%'], labels=mf_ratio.index, autopct='%1.1f%%',
colors=colors, explode=explode, shadow=True, startangle=90,
textprops={'color': 'white'})
plt.legend(loc='upper right')
plt.title('Distribution of Content Types')
plt.show()
The pie chart visualization shows that approximately 70% of the content on Netflix consists of film, while the remaining 30% are TV shows. Next, to identify the top 10 countries where Netflix is popular, we can use the following code:
Next, to identify the top 10 countries where Netflix is popular, we can use the following code:
# Remove white spaces from 'country' column
df_country['country'] = df_country['country'].str.rstrip()
# Find value counts
country_counts = df_country['country'].value_counts()
# Select the top 10 countries
top_10_countries = country_counts.head(10)
# Plot the top 10 countries
plt.figure(figsize=(16, 10))
colors = ['#b20710'] + ['#221f1f'] * (len(top_10_countries) - 1)
bar_plot = sns.barplot(x=top_10_countries.index, y=top_10_countries.values, palette=colors)
plt.xlabel('Country')
plt.ylabel('Number of Titles')
plt.title('Top 10 Countries Where Netflix is Popular')
# Add count values on top of each bar
for index, value in enumerate(top_10_countries.values):
bar_plot.text(index, value, str(value), ha='center', va='bottom')
plt.show()
The bar chart visualization reveals that the United States is the top country where Netflix is popular.
To identify the top 10 actors with the highest number of appearances in movies and TV shows, you can use the following code:
# Count the occurrences of each actor
cast_counts = df_cast['cast'].value_counts()[1:]
# Select the top 10 actors
top_10_cast = cast_counts.head(10)
plt.figure(figsize=(16, 8))
colors = ['#b20710'] + ['#221f1f'] * (len(top_10_cast) - 1)
bar_plot = sns.barplot(x=top_10_cast.index, y=top_10_cast.values, palette=colors)
plt.xlabel('Actor')
plt.ylabel('Number of Appearances')
plt.title('Top 10 Actors by Movie/TV Show Count')
# Add count values on top of each bar
for index, value in enumerate(top_10_cast.values):
bar_plot.text(index, value, str(value), ha='center', va='bottom')
plt.show()
The bar chart shows that Anupam Kher has the highest appearances in movies and TV shows.
To identify the top 10 directors who have directed the highest number of movies or TV shows, you can use the following code:
# Count the occurrences of each actor
director_counts = df_director['director'].value_counts()[1:]
# Select the top 10 actors
top_10_directors = director_counts.head(10)
plt.figure(figsize=(16, 8))
colors = ['#b20710'] + ['#221f1f'] * (len(top_10_directors) - 1)
bar_plot = sns.barplot(x=top_10_directors.index, y=top_10_directors.values, palette=colors)
plt.xlabel('Director')
plt.ylabel('Number of Movies/TV Shows')
plt.title('Top 10 Directors by Movie/TV Show Count')
# Add count values on top of each bar
for index, value in enumerate(top_10_directors.values):
bar_plot.text(index, value, str(value), ha='center', va='bottom')
plt.show()
The bar chart displays the top 10 directors with the most movies or TV shows. Rajiv Chilaka seems to have directed the most content in the Netflix library.
To analyze the distribution of content in different categories, you can use the following code:
df_listed_in['listed_in'] = df_listed_in['listed_in'].str.strip()
# Count the occurrences of each actor
listed_in_counts = df_listed_in['listed_in'].value_counts()
# Select the top 10 actors
top_10_listed_in = listed_in_counts.head(10)
plt.figure(figsize=(12, 8))
bar_plot = sns.barplot(x=top_10_listed_in.index, y=top_10_listed_in.values, palette=colors)
# Customize the plot
plt.xlabel('Category')
plt.ylabel('Number of Movies/TV Shows')
plt.title('Top 10 Categories by Movie/TV Show Count')
plt.xticks(rotation=45)
# Add count values on top of each bar
for index, value in enumerate(top_10_listed_in.values):
bar_plot.text(index, value, str(value), ha='center', va='bottom')
# Show the plot
plt.show()
The bar chart shows the top 10 categories of movies and TV shows based on their count. “International Movies” is the most dominant category, followed by “Dramas.”
To analyze the addition of movies and TV shows over time, you can use the following code:
# Filter the DataFrame to include only Movies and TV Shows
df_movies = df[df['type'] == 'Movie']
df_tv_shows = df[df['type'] == 'TV Show']
# Group the data by year and count the number of Movies and TV Shows
# added in each year
movies_count = df_movies['year_added'].value_counts().sort_index()
tv_shows_count = df_tv_shows['year_added'].value_counts().sort_index()
# Create a line chart to visualize the trends over time
plt.figure(figsize=(16, 8))
plt.plot(movies_count.index, movies_count.values, color='#b20710',
label='Movies', linewidth=2)
plt.plot(tv_shows_count.index, tv_shows_count.values, color='#221f1f',
label='TV Shows', linewidth=2)
# Fill the area under the line charts
plt.fill_between(movies_count.index, movies_count.values, color='#b20710')
plt.fill_between(tv_shows_count.index, tv_shows_count.values, color='#221f1f')
# Customize the plot
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Movies & TV Shows Added Over Time')
plt.legend()
# Show the plot
plt.show()
The line chart illustrates the number of movies and TV shows added to Netflix over time. It visually represents the growth and trends in content additions, with separate lines for films and TV shows.
Netflix saw its real growth starting from the year 2015, & we can see it added more Movies than TV Shows over the years.
Also, it is interesting that the content addition dropped in 2020. This could be due to the pandemic situation.
Next, we explore the distribution of content additions across different months. This analysis helps us identify patterns and understand when Netflix introduces new content.
To investigate this, we extract the month from the ‘date_added’ column and count the occurrences of each month. Visualizing this data as a bar chart allows us to quickly identify the months with the highest content additions.
# Extract the month from the 'date_added' column
df['month_added'] = pd.to_datetime(df['date_added']).dt.month_name()
# Define the order of the months
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December']
# Count the number of shows added in each month
monthly_counts = df['month_added'].value_counts().loc[month_order]
# Determine the maximum count
max_count = monthly_counts.max()
# Set the color for the highest bar and the rest of the bars
colors = ['#b20710' if count == max_count else '#221f1f' for count in monthly_counts]
# Create the bar chart
plt.figure(figsize=(16, 8))
bar_plot = sns.barplot(x=monthly_counts.index, y=monthly_counts.values, palette=colors)
# Customize the plot
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Content Added by Month')
# Add count values on top of each bar
for index, value in enumerate(monthly_counts.values):
bar_plot.text(index, value, str(value), ha='center', va='bottom')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Show the plot
plt.show()
The bar chart shows that July and December are the months when Netflix adds the most content to its library. This information can be valuable for viewers who want to anticipate new releases during these months.
Another crucial aspect of Netflix’s content analysis is understanding the distribution of ratings. By examining the count of each rating category, we can determine the most prevalent types of content on the platform.
We start by calculating the occurrences of each rating category and visualize them using a bar chart. This visualization provides a clear overview of the distribution of ratings.
# Count the occurrences of each rating
rating_counts = df['rating'].value_counts()
# Create a bar chart to visualize the ratings
plt.figure(figsize=(16, 8))
colors = ['#b20710'] + ['#221f1f'] * (len(rating_counts) - 1)
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette=colors)
# Customize the plot
plt.xlabel('Rating')
plt.ylabel('Count')
plt.title('Distribution of Ratings')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Show the plot
plt.show()
Upon analyzing the bar chart, we can observe the distribution of ratings on Netflix. It helps us identify the most common rating categories and their relative frequency.
Genres play a significant role in categorizing and organizing content on Netflix. Analyzing the correlation between genres can reveal interesting relationships between different types of content.
We create a genre data DataFrame to investigate genre correlation and fill it with zeros. By iterating over each row in the original DataFrame, we update the genre data DataFrame based on the listed genres. We then create a correlation matrix using this genre data and visualize it as a heatmap.
# Extracting unique genres from the 'listed_in' column
genres = df['listed_in'].str.split(', ', expand=True).stack().unique()
# Create a new DataFrame to store the genre data
genre_data = pd.DataFrame(index=genres, columns=genres, dtype=float)
# Fill the genre data DataFrame with zeros
genre_data.fillna(0, inplace=True)
# Iterate over each row in the original DataFrame and update the genre data DataFrame
for _, row in df.iterrows():
listed_in = row['listed_in'].split(', ')
for genre1 in listed_in:
for genre2 in listed_in:
genre_data.at[genre1, genre2] += 1
# Create a correlation matrix using the genre data
correlation_matrix = genre_data.corr()
# Create the heatmap
plt.figure(figsize=(20, 16))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
# Customize the plot
plt.title('Genre Correlation Heatmap')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
# Show the plot
plt.show()
The heatmap demonstrates the correlation between different genres. By analyzing the heatmap, we can identify strong positive correlations between specific genres, such as TV Dramas and International TV Shows, Romantic TV Shows, and International TV Shows.
Understanding the Duration of movies and TV shows provides insights into the content’s length and helps viewers plan their watching time. By examining the distribution of movie lengths and TV show durations, we can better understand the content available on Netflix.
To achieve this, we extract the movie lengths, and TV show episode counts from the ‘duration’ column. We then plot histograms and box plots to visualize the distribution of movie lengths and TV show durations.
# Extract the movie lengths and TV show episode counts
movie_lengths = df_movies['duration'].str.extract('(\d+)', expand=False).astype(int)
tv_show_episodes = df_tv_shows['duration'].str.extract('(\d+)', expand=False).astype(int)
# Plot the histogram
plt.figure(figsize=(10, 6))
plt.hist(movie_lengths, bins=10, color='#b20710', label='Movies')
plt.hist(tv_show_episodes, bins=10, color='#221f1f', label='TV Shows')
# Customize the plot
plt.xlabel('Duration/Episode Count')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Lengths and TV Show Episode Counts')
plt.legend()
# Show the plot
plt.show()
Analyzing the histograms, we can observe that most movies on Netflix have a duration of around 100 minutes. On the other hand, most TV shows on Netflix have only one season.
Additionally, by examining the box plots, we can see that movies longer than approximately 2.5 hours are considered outliers. For TV shows, finding those with more than four seasons is uncommon.
We can plot line charts to understand how movie lengths and TV show episode counts have evolved over the years. Identifying patterns or shifts in content duration by analyzing these trends.
We start by extracting the movie lengths and TV show episode counts from the ‘duration’ column. Then, we create line plots to visualize the changes in movie lengths and TV show episodes over the years.
import seaborn as sns
import matplotlib.pyplot as plt
# Extract the movie lengths and TV show episodes from the 'duration' column
movie_lengths = df_movies['duration'].str.extract('(\d+)', expand=False).astype(int)
tv_show_episodes = df_tv_shows['duration'].str.extract('(\d+)', expand=False).astype(int)
# Create line plots for movie lengths and TV show episodes
plt.figure(figsize=(16, 8))
plt.subplot(2, 1, 1)
sns.lineplot(data=df_movies, x='release_year', y=movie_lengths, color=colors[0])
plt.xlabel('Release Year')
plt.ylabel('Movie Length')
plt.title('Trend of Movie Lengths Over the Years')
plt.subplot(2, 1, 2)
sns.lineplot(data=df_tv_shows, x='release_year', y=tv_show_episodes,color=colors[1])
plt.xlabel('Release Year')
plt.ylabel('TV Show Episodes')
plt.title('Trend of TV Show Episodes Over the Years')
# Adjust the layout and spacing
plt.tight_layout()
# Show the plots
plt.show()
Analyzing the line charts, we observe exciting patterns. We can see that movie length initially increased until around 1963-1964 and then gradually dropped, stabilizing around an average of 100 minutes. This suggests a shift in audience preferences over time.
Regarding TV show episodes, we have noticed a consistent trend since the early 2000s, where most TV shows on Netflix have one to three seasons. This indicates a preference for shorter series or limited series formats among viewers.
Analyzing the most common words used in titles and descriptions can provide insights into the themes and content focus on Netflix. We can generate word clouds to uncover these patterns based on the titles and descriptions of Netflix’s content.
from wordcloud import WordCloud
# Concatenate all the titles into a single string
text = ' '.join(df['title'])
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10).generate(text)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# Concatenate all the titles into a single string
text = ' '.join(df['description'])
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
min_font_size = 10).generate(text)
# plot the WordCloud image
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
Examining the word cloud for titles, we observe that terms like “Love,” “Girl,” “Man,” “Life,” and “World” are frequently used, indicating the presence of romantic, coming-of-age, and drama genres in Netflix’s content library.
Analyzing the word cloud for descriptions, we notice dominant words such as “life,” “find,” and “family,” suggesting themes of personal journeys, relationships, and family dynamics prevalent in Netflix’s content.
Analyzing the duration distribution for movies and TV shows allows us to understand the typical length of content available on Netflix. We can create box plots to visualize these distributions and identify outliers or standard durations.
# Extracting and converting the duration for movies
df_movies['duration'] = df_movies['duration'].str.extract('(\d+)', expand=False).astype(int)
# Creating a boxplot for movie duration
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_movies, x='type', y='duration')
plt.xlabel('Content Type')
plt.ylabel('Duration')
plt.title('Distribution of Duration for Movies')
plt.show()
# Extracting and converting the duration for TV shows
df_tv_shows['duration'] = df_tv_shows['duration'].str.extract('(\d+)', expand=False).astype(int)
# Creating a boxplot for TV show duration
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_tv_shows, x='type', y='duration')
plt.xlabel('Content Type')
plt.ylabel('Duration')
plt.title('Distribution of Duration for TV Shows')
plt.show()
Analyzing the movie box plot, we can see that most movies fall within a reasonable duration range, with few outliers exceeding approximately 2.5 hours. This suggests that most movies on Netflix are designed to fit within a standard viewing time.
For TV shows, the box plot reveals that most shows have one to four seasons, with very few outliers having longer durations. This aligns with the earlier trends, indicating that Netflix focuses on shorter series formats.
With the help of this article, we have been able to learn about-
Please find below the official links to the libraries used in our analysis. You can refer to these links for more information on the methods and functionalities provided by these libraries:
A. Netflix is a data-driven company as it relies on extensive data collection and analysis to make informed decisions about content creation, recommendation algorithms, user experience, and business strategies. Data guides their understanding of user preferences, viewing habits, and market trends to drive innovation and personalized recommendations.
A. The Big Data strategy of Netflix involves leveraging large volumes of data from user interactions, streaming patterns, content metadata, and demographic information. This data is processed, analyzed, and utilized to enhance content discovery, optimize user experience, and inform decision-making across the organization.
A. Netflix employs various methods for data collection, including tracking user interactions on their platform, analyzing streaming data, conducting surveys and experiments, utilizing social media sentiment analysis, and gathering demographic information through user profiles.
A. Netflix’s competitive advantage in big data lies in their ability to harness vast amounts of user data to personalize content recommendations, optimize content production decisions, and create a seamless and tailored user experience. This data-driven approach enables them to deliver highly engaging and relevant content, increasing customer satisfaction and retention.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,