Churn or not (Churn Analytics) — My Thought Process

5 min readAug 3, 2022

Please note that the technique used in this project is experimental.

This article is a breakdown of how I arrived at my second Excel project from start to finish. To read details on the insights, please read this report HERE.

The dataset was gotten from the Maven Analytics data playground.

The Problem

According to the challenge description, I’m expected to answer the big question: “Identify high-value customers and churn risks.”

If you noticed, this big question looks like two big questions: “Identify high-value customers” and “Identity churn risks.” Answering those two questions in one dashboard or report will be complicated.

Why did I say so?

Churn risk is not the opposite of a high-value customer. Churn risk is the likelihood that a customer will stop patronizing a business. In comparison, a high-value customer has contributed so much to a customer’s bottom line for the survival and profitability of the company. This is different from an existing customer. Not all existing customers are high-value customers.

It would make sense and be a lot easier to put contrasting characters side by side to compare them. But we can’t do that here because these two terms are not precisely contradictory.

So what did I do?

I chose to answer one question. If I were to answer both questions, I would do different dashboards and reports for them, but I wanted to do only one report, so I chose to answer one question.

I chose to identify churn risks.

Understanding the data

The dataset contained 7043 rows and 38 columns. It was void of duplicate values. However, it had null values. These null values were due to the non-association of the variables to the kind of service (Phone or Internet service) that the customers subscribed to.

For example, if the customer subscribed to Phone service only, the variable “Streaming TV” will have a null value for this customer. This is because it’s only those that have internet service that stream TV.

I needed to handle these null values so they won’t affect my analysis. So for every null value like this, I inputted the word “None.” This helped me to distinguish them easily.

Going forward, The columns were made up of these variables:

I conducted intensive research on churn analysis and noticed that when identifying churn risks, it is more effective to divide them into different segments. The major segments are demographics, finance, and location.

It would have been quicker to jump into the dataset, categorize my variables into these three segments, and start my visualizations, but I didn’t do that. Instead, I figured out the necessary variables. This was possible using Machine learning.

Applying Machine Learning

It wasn’t enough to know the distribution of my dataset. I needed to know the variables that affected the likelihood of a customer churning. With Entropy and Information gain, I could identify these relevant variables.

Entropy is the degree of uncertainty in a dataset. While Information Gain is the magnitude of information, an independent variable can be used to determine a dependent variable. The more the Entropy, the less the Information gain.

Entropy formula(E):

Source

Information gain formula:

The Entropy and Information Gain of each variable were calculated.

The work isn’t over yet. After calculating the information gain of all the variables in my dataset, I sorted them in descending order to know the variables to prioritize over the other.

Below is a snapshot of these variables and their information gain.

The gender, phone service, and multiple lines had zero information gain. This was a revelation that these variables were contributing nothing to determining the churn risks of customers. Three variables were out. But what about the other 21? There was still too much to visualize.

To still cut down these 21 variables, I used a decision tree.

Starting from the variables with the highest information gain, I checked for variables where one class out of the two classes (Churned and Stayed) had significantly greater proportions than the other class. Only two categorical variables met this condition: Contract type and Offer type.

Notice the clear classifications in this snapshot. The month-to-month contract and offer E had the highest churn proportion than the stayed.

Unlike in this snapshot below, where the “Married” variable had more significant proportions of both the “yes” and “no” on one side(stayed).

I treated the continuous variables that had information gain greater than zero differently. For the continuous variables, I checked for the numbers with the highest churn probability. This was easier to identify, and three variables were selected: Total Revenue, Refunds, and Tenure in Months.

I concluded this with the decision tree below:

The decision tree has a hierarchy where more essential variables are placed at the top.

Explanatory Visualization

After selecting the relevant variables, it was time to visualize. I chose the histogram for the continuous variables and the normalized stacked chart for the categorical variables. These were charts that would make it easier to communicate to my target audience. I also used a people chart to show my audience what would happen if they didn’t act.

To see all these, including the insights and recommendations for this project, you can read about it HERE.

Thank you for dedicating your time to go through my thought process in “identifying churn risks” for the Maven Telecoms company. I’m super grateful.

Do you think there’s something else I should have done? Or are you satisfied with how I analyzed this dataset and filtered the relevant metrics from the irrelevant ones? Please comment below. I would love to know what you think about this.

Thank you!

Churn or not (Churn Analytics) — My Thought Process

The Problem

Understanding the data

Applying Machine Learning

Explanatory Visualization

Written by Nancy Amandi