A common problem across businesses in many industries is that of customer churn. Businesses often have to invest substantial amounts attracting new clients, so every time a client leaves it represents a significant investment lost. Both time and effort then need to be channelled into replacing them. Being able to predict when a client is likely to leave and offer them incentives to stay can offer huge savings to a business. This is the essence of customer churn prediction; how can we quantify if and when a customer is likely to churn?
One way we can make these predictions is by the application of machine learning techniques. In this blog we will step through a simple approach to building an effective model. Although it can seem intimidating, with the correct approach effective results can be obtained with relative ease in a short time frame.
The classic use case for predicting churn is in the telecoms industry; we can try this ourselves using a publicly available dataset which can be downloaded here. To make our predictions we will be coding in Python and using the scikit-learn library, which contains a host of common machine learning algorithms. For our simple example we will use one of the most popular and easy to understand algorithms, the random forest. For a more in-depth tutorial on tree based models such as random forests (this time in R and Python), look here.
If you're not a coder, just want a quick summary of the steps, or perhaps are looking for a more visual approach to solving this problem, skip towards the end of the blog.
The first thing we need to do is to import all of the relevant python libraries that we will need for our analysis. Libraries such as numpy, pandas, statsmodels, and scikit-learn are frequently utilised by the data science community.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.metrics import roc_curve import matplotlib import matplotlib.pyplot as plt from IPython.display import display, HTML
Now we are ready to begin! After defining our question of interest, the next stage of any data science project is to extract our data. Luckily for us, we have our dataset available in an easily accessible CSV, and we can use the convenient pandas method read_csv() to load it into our environment.
Once our dataset is loaded we can inspect the data using the head() method to have a quick look at what columns and what kind of data we have available to work with.
df = pd.read_csv("path_to_our_data/churn.csv") display(df.head(5))
Although in this case we have our data in a simple .csv file, at this stage we might have connected to SQL databases such as internal CRM or ERP systems, loaded data from Excel or log files, scraped data from the web, or extracted data from a variety of other different sources. As a general rule of thumb, we would like to get our hands on as varied a set of features as possible at the start of our project to be able to start determining what might be useful for our purposes / help us solve our question of interest.
Now we can perform some basic exploratory analysis to get a better understanding of what is in our data. For example we would like to know:
We could also take this opportunity to plot some charts to help us get an idea of what variables / features will prove useful. For example, if we where thinking of doing some regression analysis, scatter charts could give us a visual indication of correlation between features.
The pandas library has plenty of built in functions to help us quickly understand summary information about our dataset. Below we use the shape() method to check how many rows are in our dataset and the describe() method to confirm whether or not our columns have missing values.
print("Number of rows: ", df.shape) counts = df.describe().iloc display( pd.DataFrame( counts.tolist(), columns=["Count of values"], index=counts.index.values ).transpose() )
At this stage we would normally begin the process of cleaning our data set, which could involve:
Fortunately, our data set has already been pre-cleaned prior to us downloading it.
After cleaning and inspecting our data we might come to the conclusion that certain columns are not going to be useful for prediction. In this example we will not be using the phone-number of the client or geographical information about the client because our assumption is that this shouldn't affect churn. In a more in-depth exercise we would first test our assumption before dropping the columns, but for our purposes we will take it as a given that our assumption is true.
Often during a data science project, this is the point where we would enrich our data with additional sources (social media feeds, weather and location data, 3rd party data) and perform any transformations we needed such as aggregations, normalization, imputation, etc. New features can also be created from existing features, for instance, the log() of a feature might be more suitable than the original, or a categorical feature could be encoded.
The cleaning of the data and feature generation is often the most important step in achieving good results; frequently it will consume the most time and effort of any part of a data science project.
# Drop the columns that we have decided won't be used in prediction df = df.drop(["Phone", "Area Code", "State"], axis=1) features = df.drop(["Churn"], axis=1).columns
A this point we can construct our model. The first thing to do is split our dataset into training and test sets. We will take a simple approach and take a 75:25 randomly sampled split.
df_train, df_test = train_test_split(df, test_size=0.25)
Once we have obtained our split we can use the RandomForestClassifier() from the sklearn library as our model. We initialise our model, fit it to our dataset using the fit() method, then simply make our predictions using the predict() method.
# Set up our RandomForestClassifier instance and fit to data clf = RandomForestClassifier(n_estimators=30) clf.fit(df_train[features], df_train["Churn"]) # Make predictions predictions = clf.predict(df_test[features]) probs = clf.predict_proba(df_test[features]) display(predictions)
Given the ease of setting up a basic model, a common approach is to initialise and train a variety of different models and pick the most performant one as a starting point. For example, we might also choose to run a support vector machine and a neural network alongside our random forest and then select the best performing of them to refine.
If we display the results we can see we have a list of booleans (0's and 1's) representing whether or not our model thinks a customer has churned or not. Now we can compare this to whether they actually churned to evaluate our model. We could also compute the actual probabilities of a customer churning using predict_proba() rather than just simple yes / no. We could then use these probabilities as a threshold for driving business decisions around which customers we need to target for retention, and how strong an incentive we need to offer them.
We can achieve the comparison mentioned above by using the .score() method, and displaying that we can see that we have achieved an accuracy of over 90%, which is not bad for our first attempt.
score = clf.score(df_test[features], df_test["Churn"]) print("Accuracy: ", score)
We can also construct a confusion matrix and and a ROC curve to dig further into the quality of our results. In a more rigorous exercise part of this stage would be to determine the most suitable scoring metric/s for our situation, undertake more robust checks of our chosen metrics, and attempt to reduce / avoid issues such as over-fitting by using methods such as k-fold cross validation.
get_ipython().magic('matplotlib inline') confusion_matrix = pd.DataFrame( confusion_matrix(df_test["Churn"], predictions), columns=["Predicted False", "Predicted True"], index=["Actual False", "Actual True"] ) display(confusion_matrix) # Calculate the fpr and tpr for all thresholds of the classification fpr, tpr, threshold = roc_curve(df_test["Churn"], probs[:,1]) plt.title('Receiver Operating Characteristic') plt.plot(fpr, tpr, 'b') plt.plot([0, 1], [0, 1],'r--') plt.xlim([0, 1]) plt.ylim([0, 1]) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
We can also plot feature importance to gain some insight as to what features were most useful in our model, which will be of great help when refining our model in the future.
fig = plt.figure(figsize=(20, 18)) ax = fig.add_subplot(111) df_f = pd.DataFrame(clf.feature_importances_, columns=["importance"]) df_f["labels"] = features df_f.sort_values("importance", inplace=True, ascending=False) display(df_f.head(5)) index = np.arange(len(clf.feature_importances_)) bar_width = 0.5 rects = plt.barh(index , df_f["importance"], bar_width, alpha=0.4, color='b', label='Main') plt.yticks(index, df_f["labels"]) plt.show()
The output of this exercise is a corresponding score (representing churn propensity) for each client, as can been seen below. For insight into how these scores could be leveraged by your business, keep reading the next section.
df_test["prob_true"] = probs[:, 1] df_risky = df_test[df_test["prob_true"] > 0.9] display(df_risky.head(5)[["prob_true"]])
So we can see how in just a small amount of code we can produce a well performing model given good data. In a real world application of course, our data would never be so tidy or easy to work with, and we would have to undertake a much more rigorous process for evaluating our predictions. But hopefully we have shown how easy it can be to get started and deliver value.
Data science is an iterative process by nature. Once we have our initial results / predictions, the next stage would be to iterate over the previous steps (cleaning the data, feature selection, modelling, etc.) in an effort to boost the accuracy (or whatever scoring metric you use) of our model. The addition or removal of features, the tuning of hyper parameters, or simply the use of a more complete or larger dataset may all boost our model.
Once a sufficient model has been trained, the next stage would be to take this model and to apply it to new data (clients not used in the training / test sets), and use these predictions to drive the behaviour of our business users. Part of any data science project's life cycle is to get the model into a production environment where it can be properly maintained, updated and have it's value leveraged by the business.
Clients could either be scored in batch (during an overnight process), or in real-time by converting the model into an scoring API. These results could then be integrated into a CRM system or similar, so customer service representatives could easily evaluate whether a client who has called in is likely to churn or not, and act accordingly. Once this system is in place, and enough data about whether or not incentives managed to convince clients to stay has accumulated, a secondary model to optimise incentive costs could then be developed.
To summarise, the steps taken to predict the churn propensity of our telecoms customers were the following:
And for further steps, we looked at:
While in this instance we have taken a code-based approach, a variety of visual tools - such as our partner Dataiku's DSS - exist to enable users to develop data science applications and work flows using intuitive visual interfaces and help accelerate their development and time to market. Below you can see the flow of a project which segments telecoms customers by their churn propensity; a similar yet slightly more detailed approach than the one presented in this post.
If you are interested in learning more about churn analysis, data science, and their applications, then feel free to join Keyrus UK at our next webinar on Predicting Churn Propensity in Telecoms. We will be joined by Dataiku to demonstrate how their DSS platform can be used to accelerate data science projects and encourage collaboration among both developers and analysts.