Guide to Churn Prediction: Part 2 — Exploring missing values

First published on December 29, 2021

 

12 minute read

Jahnavi C.

TLDR

We will explore the Telco Customer Churn dataset to understand different types of features and find missing values in the dataset.

Outline

  • Recap

  • Exploratory Data Analysis

    • Dependent and Independent features

    • Missing values and its mechanisms

  • Conclusion

Recap

In part 1 of the series,

, we’ve gone over and implemented the first

3

steps (Define the objective, Data gathering, Data cleaning) required to build a machine learning model on a

Telco Customer Churn

dataset and saved the

for further analysis.

Step 4: Exploratory Data Analysis (EDA)

We will perform Exploratory Data Analysis (EDA) on the

cleaned 

Telco Customer dataset to gain

insights

from the data and to

identify patterns

in the data.

I heard many people saying that, in order

to analyze and understand the data, one should think like a detective 

🕵️‍♀️

it’s

 

because

 

the more you understand the data, the better the machine learning models will be. So, let’s start exploring and try to get as much information as possible from the data by asking some basic questions like:

To skim the dataset

1. What are Dependent and Independent features? 2. Is the dependent feature categorical or numerical? 3. Are there missing values/data in the features? If so, what might be the reason?

Now without further ado, let’s start our investigation to find answers to the above questions.

Import libraries and load dataset

Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to

to see how we cleaned the dataset.

1
2
3
import pandas as pd
df = pd.read_csv('cleaned_dataset.csv')
df 

Cleaned dataset

4.1. What are Dependent and Independent features?

To identify dependent and independent features we’ll take a look at few records of dataset and meta information.

1
df.head() # displays first five rows and all the columns in the dataset

First 5 records of cleaned dataset

Meta information

Note

: Please read

to know more about

objective

and

meta information

.

Observations:

  1. Dependent feature

    : This is a feature that represents the objective, and the objective of this project is to predict customer churn (i.e., Yes or No). Based on the dataset and meta information, “Churn Label” is the only feature that indicates customer churn with values of “Yes” and “No.” Therefore, we can consider “Churn Label” as a dependent feature.

  2. Independent features

    : These features help us predict the output (i.e., “Churn Label”). Therefore, all features except for “Churn Label” will be considered as independent features or inputs.

4.2. Is the dependent feature categorical or numerical?

If the dependent feature is of an

object 

data type, then it’s

categorical

, otherwise it’s

numerical

.

Note

: Sometimes categorical features are in the form of numbers. In such cases, we first check whether the numerical data is in continuous or discrete form. If the data is in discrete form and has unique values (1,2,3,4,5 or 0 and 1, etc.) or categories, we call that feature categorical.

So, let’s check data type of the dependent feature, i.e., “Churn Label”.

1
df['Churn Label'].dtype 

Datatype of “Churn Label”

Observations:

The “Churn Label” column is of an object data type. This means the dependent feature is

categorical

.

Missing values

4.3. Are there missing values/data in the features? If so, what might be the reason?

If the feature values are NaN or missing, then those values are known as missing values or missing data. The pattern of occurrence of these missing values is known as

missing data mechanism

. Missing values can be categorized into different types based on the missing data mechanisms.

There are 3 types of

missing data mechanisms

:

  • Missing completely at random

    (MCAR) : This indicates that there is no specific reason for why the values are missing. For eg: Human errors (i.e., data is missed, as the operators accidentally skipped adding the values.)

  • Missing at random

    (MAR) : In this case, data will not be recorded due to certain known reasons i.e., the data is intentionally not provided by the data provider. For eg: People with higher or lower salaries generally don’t disclose their details.

  • Missing not at random

    (MNAR) : In this case, data will not be recorded because data doesn’t exist due to some hidden reasons i.e., the data is not provided by the data collector. For eg: a. People didn’t fill in the salary details because they are not working. b. Temperature values are not recorded as the sensors are worn out over time.

Here’s a dataset with

Designation

and

Salary

as features displayed in a tabular form to show different missing data mechanisms.

Missing data mechanism

Steps to check for missing values:

  • Calculate the total count of missing values in each column using Pandas.

  • Calculate the percentage of missing values for the columns.

  • Discover which missing data mechanisms match.

Total count of missing values

Let’s first check if there are missing values in the columns by using

isnull() 

and

any() 

methods. If there are missing values in a column, then display the total count of missing values in those columns by using the

sum()

method.

1
2
3
nan_col = df.columns[df.isnull().any()]
for i in nan_col:
    print(i, df[i].isnull().sum())

Display the columns and count of missing values

Observations:

1. In the “Total Charges” column, there are 11 missing values. 2. In the “Churn Reason” column, there are 5174 missing values.

Percentage of missing values

Calculate the percentage of missing values in Total Charges and Churn Reason features.

1
2
print(f"There are {df['Total Charges'].isnull().sum()/len(df['Total Charges']) * 100}% missing values in Total Charges feature")
print(f"There are {df['Churn Reason'].isnull().sum()/len(df['Churn Reason']) * 100}% missing values in Churn Reason feature")

Percentage of missing values

Discover which missing data mechanisms match

Identifying the missing data mechanism helps us choose the right techniques required to replace the NaN values with some data.

Discover which missing data mechanism matches the missing data in the “Total Charges” column.

Let’s observe all the records/rows of the dataset where the “Total Charges” column has missing data.

1
df[df['Total Charges'].isnull()]

Records where the “Total Charges” column is NaN

Observations:

When “Tenure Months” is 0, the “Total Charges” has a NaN value. So let’s take a closer look to see if the “Total Charges” values are influenced by the “Tenure Months”.

Create a new dataset called

df_new

with only “Tenure Months” and “Total Charges” as features.

1
2
df_new = df[['Tenure Months','Total Charges']]
df_new

A new dataset with features (“Tenure Months” and “Total Charges”)

Now, let’s check the count of each unique value in the “Tenure Months” column by using the

value_counts()

method.

1
df_new['Tenure Months'].value_counts() 

Unique values and their count in the “Tenure Months” column

Observations:

There are 11 records/rows with value 0 in the “Tenure Months” column.

Let’s display all the rows/records where the “Tenure Months” column is 0.

1
df_new[df_new['Tenure Months']==0] 

“Tenure Months” records with value 0

Observations:

The 11 records/rows that have NaN values in the “Total Charges” column have a corresponding value of 0 months in the “Tenure Months” column.

Reason: “Total Charges” column values are missing because the data doesn’t exist. There is a possibility that customers began using their services by the end of the quarter, or that they’re in the middle of a free trial period. Therefore, we can conclude that the missing values in the “Total Charges” column are following the

MNAR

(Missing Not At Random)

mechanism

.

Discover which missing data mechanism matches the missing data in the “Churn Reason” column.

We know that there are 5174 missing values in the “Churn Reason” column. Now, let’s inspect the missing data mechanism in the “Churn Reason” column. For this, let’s check the values of the “Churn Label” column when “Churn Reason” is NaN.

The reason for only looking at the “Churn Label” and “Churn Reason” columns is that we can deduce from the meta info that “Churn Reason” entries will only be available if a customer leaves the company. For reference, see the image below. The “Churn Reason” has data about the reasons why a customer left the company, and the data about the customers who have left the company is available in the “Churn Label” feature.

Meta information

Create a new dataset with the features “Churn Label” and “Churn Reason” for further analysis.

1
2
df_new = df[['Churn Label','Churn Reason']] 
df_new

Display all the records of “Churn Label” and “Churn Reason”

Check the count of unique values in the “Churn Label” column.

1
df_new['Churn Label'].value_counts()

The number of unique values in “Churn Label”

Display only the rows that have “Churn Label” equal to “No”.

1
df_new[df_new['Churn Label']=='No'] 

“Churn Label” with value as “No”

Observations:

  1. From the count of unique values in “Churn Label”, there are 5174 records equal to “No”.

  2. The number of records in “Churn Label” with “No” is the same as the number of records in “Churn Reason” with “NaN”.

From the observations, we can conclude that the presence of missing values in the “Churn Reason” column is because

the data didn’t exist

, and the data didn’t exist because people didn’t churn (i.e., people are still using their services). Therefore, we can say that these missing values are following the

MNAR

(Missing Not At Random)

mechanism

.

Conclusion

Identifying the dependent and independent features in a dataset makes it easier when we analyze the features in depth. It’s also important to find the missing data and their mechanisms, as missing data affects the performance of the machine learning model.

That’s it for this blog. Next in this series, we’ll continue to perform EDA to analyze the features and dive deep into the dataset by answering questions like:

  1. What are the summary statistics of numerical features?

  2. What are the summary statistics of categorical features?

  3. Are the numerical features normally distributed or is there any skewness?

  4. How many categories are there in each categorical feature?

  5. Are there any outliers in the features? If so, can we really treat them as outliers?

  6. Which independent features are correlated with the dependent feature?

  7. Is there any correlation between the independent features?

Thanks for reading!!