Visualizing Netflix viewership data
It is a Saturday night, and you are getting ready to start a new Netflix session and watch your current favorite show. This time you even have popcorn. Suddenly, a moment of epiphany: how many sessions did you have last year? Among those who share the account with you, what is the most-watched show? Luckily, Netflix is a data company as much as it is a streaming company, so they know it all: the time you started, where you are watching from, how long each session lasts, how you interact with the Netflix client, and obviously, which show you are watching.
Today, I am analyzing my own Netflix viewership data. There are two basic steps:
- Acquire the data
- Analyze and visualize
Acquiring the data
Netflix allows us to download all data they gathered about our accounts. If you want to access your own data, log in to your account and visit this page. You also have to click on the ‘Submit Request’ button (see image below). Netflix may take some time to compile your information, but they will email you as soon as they are finished.
They will send you lots of information, but we are interested in two files only: Cover sheet.pdf and ViewingActivity.csv. The former has a detailed description of all columns in the spreadsheets; the latter has the viewership information we need.
Analyzing and Visualizing your data
Once we have the data, we can start exploring! For this task, I will use a Jupyter Notebook. Before I jump into coding, I defined a few questions I want to answer with this data.
- Which is the most common device for watching shows?
- What was the most-watched show?
- Who spent more time watching shows on the platform?
- What was the month in which the users watched the longest?
With these questions in mind, we can proceed. The first step is to import the necessary libraries and load the data.
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Importing dataset
df = pd.read_csv('ViewingActivity.csv')
We begin our exploratory analysis visualizing our data to get a sense of it.
Next, we check the first five rows using the method head().
df.dtypes
When we ask for the types, pandas tells us that all columns are strings.
Profile Name object
Start Time object
Duration object
Attributes object
Title object
Supplemental Video Type object
Device Type object
Bookmark object
Latest Bookmark object
Country object
dtype: object
Some columns describe specific user interaction with the video player and do not represent significant data for the analysis we are performing here. The column Attributes, for instance, gives us details about how the user interacted with the content (i.e., the user visualized a series page). Again, information regarding each column is available in the file Cover sheet.pdf. Because these columns will not help us answer our target questions, we will drop them.
# Dropping columns
df.drop(['Attributes', 'Supplemental Video Type', 'Bookmark', 'Latest Bookmark'], axis=1, inplace=True)
A new call to head() shows that the operation was successful.
We also check for null values.
# Checking null values
df.isnull().sum()Profile Name 0
Start Time 0
Duration 0
Title 0
Device Type 0
Country 0
dtype: int64
Answering our questions
1. Which is the most common device for watching shows?
The column Device Type shows us in which device a user watched a show.
Notice that there are several types of devices; we can classify them into main categories: TV, browser, Xbox, PC, and Android. To change the current texts, we locate the strings where the word Chrome appears, for example, and replace them for the appropriate category.
The column looks more comprehensible now. Yet, we still do not have our answer. To obtain it, we can look at the value counts of the Device Type column.
More than 7k Netflix sessions occurred on the TV. It looks like we used the TV the most, followed by our mobile devices. We can also visualize this information through a graph.
sns.set_style('whitegrid') # style configuration
sns.set_context('talk') # figure overall size setupfig, ax = plt.subplots(figsize=(6,6))
sns.countplot(ax=ax, x='Device Type', data=df, order=df['Device Type'].value_counts().index)
ax.set_ylabel('Netflix Sessions')
ax.set_xlabel('Device Type')
ax.set_title('Most used devices to watch Netflix')
plt.show()
2. Who spent more time watching shows on the platform?
Before verifying which profiles spent the most time on the platform by summing all values of the column Duration, we have to convert these values to a format we can work with. In Python, we can sum time using timedelta.
import datetime
def convert_to_deltatime(time):
''' This function converts a string to a timedelta and returns it'''
hour, minute, second = time.split(':')
return datetime.timedelta(hours=int(hour), minutes=int(minute), seconds=int(second))
With the help of the conversion function, we can convert each value of the column Duration to the format we want.
df['Duration'] = df['Duration'].apply(convert_to_deltatime)
Then it becomes easy, we just have to group the data by user and sum their respective times.
watch_time_days = df.groupby(by=['Profile Name'])['Duration'].sum().dt.days
A graph will show the information we need
fig, ax = plt.subplots(figsize=(6,6))
watch_time_days.plot(kind='bar')
ax.set_ylabel('Total time of Netflix sessions (in days)')
ax.set_xlabel('User')
ax.set_title('Time (in days) spent watching Netflix content per user')
plt.show()
Profile 2 was the one who spent the most time on the platform approaching 70 days of content watched. For the exact numbers, we have
3. What was the month in which the users watched the longest?
The column Start Time tells us when the user started watching a show. In its current format, it is a string. We want to convert it to the proper date format.
df['Start Time'] = pd.to_datetime(df['Start Time'], format='%Y-%m-%d %H:%M:%S', utc=True)
We will add another column to accomodate the specific month and year in which each Netflix session occurred.
df['Session_Month'] = df['Start Time'].dt.to_period('M')
Then we group the data by period of time and sum the Duration, like we did before.
monthly_watchtime = df.groupby(by=['Session_Month'])['Duration'].sum().dt.days
We can plot the graph
fig, ax = plt.subplots(figsize=(6,6))
ax.set_ylabel('viewship (in days)')
ax.set_xlabel('Period (in months)')
ax.set_title('Number of days watched per month')
monthly_watchtime.plot()
Observe the peak in the first months of 2020, when the pandemic began
4. What was the most watched show?
Because titles may also indicate the season and episode to which a show belongs, we risk counting a show more than once. Therefore, we will remove the words after the first colon.
df['Title'] = df['Title'].str.split(':').str[0]
Observe that we have some rows in which the column Duration has less than five minutes. It may have happened for several reasons, but we understand that the show was not watched fully. Perhaps the user was exploring to get a feeling of the show. In any case, we will remove these rows using an arbitrary threshold.
The column Duration also represents the duration of session in hours, so we need a conversion from string to time
df['Duration'] = pd.to_datetime(df['Duration'], format='%H:%M:%S')
We get the value counts for the column.
most_watched_df['Title'][df['Duration'].dt.minute > 10].value_counts()
It seems Vikings was the most-watched show. The number is interesting because this series does not have 360 episodes; we have to remember, though, that this count spans multiple profiles.
This comparison is also unfair because we would have to compare shows according to their categories to find which film was the most-watched, for example. The problem with the current approach is that series require more sessions than films, for example, simply because they are longer and have multiple episodes. Unfortunately, Netflix does not provide the category to which a show belongs. In any case, we have an overall idea of which shows are watched the most.
Thank you for reading.
Please, contact me if you have any questions: https://www.linkedin.com/in/tsantosfigueira/
You can find the code here: https://github.com/TSantosFigueira/Netflix_Viewership_Analysis