Group 5

  • BARBEAU Amandine
  • CHAU Wai Ping
  • LAARIF Rim
  • LU Shurui
  • RAO Mingzi
  • WANG Yifan
  • Introduction

    In the digital age, businesses are relying heavily on the technologies and are very data-rich. But ironically it is complicated for companies to find customer and to keep them. The needs, the expectations of customers, their behaviours, and values and even their tastes are evolving ever faster. The competition is also stronger in all areas with the facilitation of the process of creating companies and the many opportunities that are created every day.

    Nowadays, loyalty is the result of a company which not only responds to the customer needs but anticipates them. Loyal customers are more valuable than new customers, who cost more to acquire and don't spend as much. The Pareto Law demonstrated that 80% of the businesses are driven by 20% of their customers. And loyal customers are the ones who constitute the customer base, and which will allow your business to grow and to maintain high profits.

    In 2021, the average churn rate in telecom businesses was 22%. Every year, it has been increasing mainly because of a strong competition. The telecommunication business has a large amount and variety of service providers and customers just change from one to one easily. Individual customer retention is difficult because most telecom companies have too many customers that they cannot devote more time to than necessary.

    Our agency has been hired by a Telecom company as an AI expert to build a model that will enable them to predict the customers with a high churn probability based on a 2-years historic customers data. They would like to put in place strategies to retain customers with high probability of churn to concentrate their efforts and optimize their expenses.

    What is our topic ?

    Our goal is to enable the telecom company to predict the customers who have a high probability of churn with a good precision to help them activate the right retention strategies. Customer churn is one of the most important metrics a growing business needs to evaluate.


    Definition "Customer churn is the percentage of customers that stopped using your company's product or service during a certain time frame" (HubSpot).

    It helps the company to identify customers who are going to churn and understand the reasons behind it, harnessed by the power of data and machine learning. And it enables the adequate teams to leverage on the kowledge inherent to the Data and to allow them to develop the tactics to achieve customer retention.

    Our methodology

    We have a database with demographic and account related information about the customers and the services they subscribed to in the telecommunication company. The identified problem has customer input variables and an output variable, which is our target: "Customer churn". It is supervised learning models that we are going to use to make predictions. First, we will clean the Dataset and transform the features. We will explore our data by visualizing and understanding it to know which information will be more useful for our analysis and if some adjustments are needed (outliers? categories...etc). Then we will explore the following prediction models:


  • Logistic Regression
  • Random Forests
  • Support Vector Machines
  • Decision Tree Regressor

  • Then, we will test these models and evaluate their performance for predicting Churn. Finally, we will conclude selecting the best model and fine-tune it with hyperparameters.

    Objectives

    We will focus our report and our analysis on the following questions:


    Dataset

    From the Dataset descriptive information, we already know that there are 7043 rows corresponding to customerID and 21 columns or features. The columns are describing customers account information, demographic information, customer subscribed services and the customer churn Vs previous month:






    1. Data Pre-Processing

    Data Cleaning

    The Dataset has 7043 rows (corresponding to a unique customer), 21 columns (features). The dataset is composed of 2 integers, 2 float and 17 object (categorical) types of columns. But when we look closer, it appears that the TotalCharges column is composed of numeric values, but its type is an object. Thus, we converted this columns into float. Then, we find 11 NaN values which should be dealt with. Even if they are corresponding to loyal customers who did not churn (one to 2 years contracts), we cannot keep them because TotalCharges is not equal to (MonthlyCharges * Contract duration), and the ’TotalCharges is dependent on the customers’ behaviors.

    The best way to handle this missing data (and avoid having a bias in our model) is to drop these rows. We now have a total of 7032 rows.

    Afterwards, we check if there are any customers that would appear in two lines. Fortunately, there are no duplicate customerID.

    Data Exploration

    Our target data is the Churn column, so we want to know the proportion or overall churn rate for the company. The company churn rate is 26.6% Vs 73.4 % of retention rate. This number provides us a baseline to compare each feature’s performance and impact on customer churn and retention.

    There is a churn rate of around 27% of the customers Vs previous month. It is normal to have a churn rate or customers who disrupt the service.

    The objective is to be able to predict the customers who will churn and to provide with a retention strategy for the concerned departments according to the features that will be identified during the exploratory phase.

    As a first step to understand our Dataset, we will visualize the numeric and the categorical variables before converting categoric Data into numeric in the preparation for machine learning. Indeed, machine learning models handle better numeric than categorical data.

    First, we defined the categorical features and dropped CustomerID because it doesn't make sense to keep it for predictions. By isolating the categorical columns (including “Churn”), we could now analyse the correlation between all the columns. Some of them contain values with the same meaning but are not grouped:



    We needed to correct the categories or values for these columns by simply replacing all 'No internet service' and 'No phone service' by 'No'.



    2. Data Visualization & exploring the correlations

    For the Categorical Variables

    To compare categorical variables, we used bar charts to visualize the relationship between them (or not).



    The comments below are made on visual observations only. They should be taken into the recommendation stage to the company only after building our machine learning model and confirming the features correlations with the churn rates.


    Below are the features which look enfavoring churn and which would need specific further investigation in a phase 2 and/or by customer service teams:


    It is the same with Streaming services and Streaming Movies services, where we can notice that there are more customers using the services who churn than those who don't.
    We would need to dig a little deeper into the numbers and if confirmed, recommend the customer service responsible for each product (or service) and the marketing teams to make a follow up on the customer journey and put the right tactics in place to adjust their operations according to the customers'expectations and needs.


    Some other features characteristics listed below enfavor customer retention :



    Nothing specific to declare:



    For the Numeric Variables

    We first defined the numerical features: SeniorCitizen, tenure, MonthlyCharges and TotalCharges. Then, we transformed the Churn feature into numeric values as well in order to see the correlations in a second step. First, we plotted histograms for all numeric attributes.




    As we can see, our customers are well distributed across the tenure graph and present two spikes: 25% are under a tenure of 9.0 (acquired customers) and 25% are above 55.0 (loyal customers); the remaining 50% are varying between 9.0 and 55.0, meaning that they are still satisfied by the services provided...until some point in time or a triggering event.

    At some point in between, the Telecom company loses 25% of customers who could have stayed some more months and who decided to disrupt the services and quit the company.

    Our model will consist at determining the features which will help the Telecom company identify these churners and support them in identifying (then investigating on a one-on-one basis potential churn reasons and deploying) the right sales, marketing, or customer support strategy to retain these customers.

    We also notice that we have a shape of a fat tailed distribution and that 75% of the customers TotalCharges are below 4000 in local currency. But 25% of the customers are above, some of them reaching total charges of about 8000 in local currency, and many of them are distributed as single values.

    This raises an important step in our data preparation to build the machine learning model: some total charges go far beyond 4000!! This is the reason why; we will handle the category above 4000 as one category in a step below.

    It is important to have a sufficient number of instances in our dataset for each stratum to have a good representation of all our customers ‘segments or the estimation of the most profitable and newly acquired customers will be biased. This means that we will need to limit the number of TotalCharges categories. We will divide the total charges attribute by 2266 corresponding to the standard deviation (to limit the number of TotalCharges categories), and we will round it up using ceil then merge the categories greater than 4000 into category 4000.

    The company will certainly need to dig more into their data and identify these highly profitable 25% of its customers, providing them with different quality of service to retain the most profitable ones. A customer segmentation analysis is recommended to be developed as a second phase analysis (it is not the subject of this research) to fine-tune the customer relationship strategy of the company and adapt the strategy to each customer segment.



    Searching for correlation

    In order to identify correlation within the data, we transformed the target variable (Churn) into numeric data. Secondly, we drew a heatmap using the Kendall method to observe the data correlation. This method enables to measure the ordinal association between the variables.

    With no surprise, the total charges and the monthly charges are correlated (0.65) and Total charges are correlated with tenure (0.82).

    The numeric variables correlated with churn are:


    • Tenure: negatively correlated, stronger correlation (-0.35)
    • Total Charges: second stronger correlation, negatively correlated (around -0.2)
    • Monthly charges and Senior citizen are positively correlated.

    The relationship between totalcharges and tenure being negatively correlated with churn is interesting, because it reminds us of the customer lifetime value (CLV = frequency * average basket * length of the relationship). We recommend a second phase analysis where we will segment the customers with their CLV in order to enable specific targeting and invest more on retaining customers with a high CLV.










    3. Data preparation for ML algorithm - Creating a test set

    Rebuilding our dataset

    First, we need to recreate our full dataset with our numeric and categorical features as per our data cleaning and pre-processing steps.

    To do this, we first need to make sure that (d_cats) == (df_nums). Once this is done, we need to merge our two parts of the dataset. We took the rows corresponding to the same CustomerID to do the inner join, then we removed the Churn column from the d_cats and renamed Churn_y to Churn, to avoid having duplicate columns and data.





    Using random selection to create a test split

    Now, the TotalCharges categories need to be limited and ceiled. With a value_counts() we can confirms what we already knew: customers are too largely distributed, many of them are alone. And indeed, we have a fat tail distribution.

    The Pareto Law looks totally in action here. An interesting finding would be to see how it is applied and to search for our 20% most profitable customers (generating 80% of our revenues), the services they subscribed to and/or characteristics and behaviours.




    So, we chose to divide our data by 2266 (standard deviation) to limit the number of TotalCharges categories. And to put a cap of 4000: every data above 4000 will be automatically labelled 4000.

    This looks better categorization to analyse our customers ‘segments characteristics and to have sufficient customers groups proportions in both the training and test sets.



    Using random selection to create a test split

    We will use Scikit-Learn’s StratifiedShuffleSplit to do a stratified sampling and be able to have a representativity of the different categories within the test and the sample datasets.





    Building a pipeline to pre-process the categorical and numeric input features

    We defined the list of numeric and categorical attributes for each column. As Machine Learning algorithms don't perform well on numeric features with different scales, we standardized the numeric features. For this we applied a standardization which subtracted the mean value then dividing by the variance resulting in a distribution with unit variance.

    In order to ensure future performance and avoid data Leakage, we used a pipeline for preprocessing (ColumTransformer) and the model building will allow it to pass the pipeline at every step during model building: for instance, for each fold in the cross-validation, and GridSearchcv model and preprocessing hyper parameters. In other words, it ensures that we have no data leakage and ensure our future model performance.






    4. Selecting and training a model

    Stratified cross-validation

    In order not to waste too much data of our dataset and contaminate it, risking to overfit our model. We used a 5 or 10-fold cross-validation, trying to find a trade-off between the variance and the bias in our model selection. Thus, we defined the best predictive model for our dataset and built our model while transforming our Data to avoid contaminating it as much as possible.


    Logistic Regression Model



    Interpretation of the logistic regression results : The scores are equivalent between the 10-folds and the mean score is about -0.25 and the RMSE is 0.52 (could be better). 
The training score and the testing scores are equivalent, respectively 0.76 and 0.73, which is good.


    Support Vector Machines




    Classification report : Now, let's have a look at the classification report to check the business application of this model for the telecom company. The recall is 48%, thus 52% of the churners are not identified. AND the precision is about 63% meaning that if churners are identified, only one third are false positives. It is acceptable from expenses prospective as two third of churners are identified and a strategy can be put in place to retain them.

    In other words, the company can use this model to predict churn and recommend to develop retention strategies by the customer/sales or makerting services.



    Interpretation of the SVM model results : With the SVC model, the standard deviation of 0.013 and the mean RMSE score across cross validations is of 0.44. The training score is 0.82 Vs 0.79 for the testing score. The model generalizes well to new data (test data). The training accuracy is 0.79. So the performance of the SVC is good.


    Decision Tree Regressor

    The training score is 0.99 with an RMSE score across each fold remains constant and is of 0.52 with a standard deviation of 0.024. But the test score is catastrophic with -0.41 . The model is clearly overfitting the training data.

    interpretation of the decision tree regressor results : The model overfits. AND we WILL NOT USE this model to make churn predictions.


    Random Forest model





    Interpreation & estimation of the random forest predictions: With the random forest model, the training socre and the test score are very close and good, respectively 0.82 and 0.79. The accuracy score is of 0.795, almost equal to the one obtained with the SVC (0.79) and it is close to 1, which is a pretty good performance. The RMSE score remains the same between the folds during the cross validation and is equal to the one obtained with the SVC, with a mean score of 0.44 + a standard deviation of 0.018.





    Classification report analysis: Let's analyse the classification report to consider if it can be used for a business application (or not). The SVC was pretty good to predict with a precision of 0.63 and a recall of 0.48.
For the random forest, the classification report shows a precision of 0.66 and a recall of 0.45 so the model is performing as good as the SVC model for both the precision and the recall with probably a little more false negative but less false positive. 
We cannot really decide to discreminate between one or the other at this stage, this is the reason why we will be running the GridSearch to fine-tune both models and find the best hyperparameters before comparing their performance and deciding which one is better.




    5. Fine-tuning the model


    Both the Random forest and the SVM have the same accuracy scores and close numbers for the classification_report, this is why, wee need to apply a Grid Search Cross validation for both models in order to fine-tune them and find the best hyperparameters for each.

    Thus, we will be able to discriminate between both models and select only one to be used for Telecom Churn prediction.


    GridSearchCV for SVC model

    We selected the SVM as a predictive model for potential churning customers. We will use Grid Search to optimize the hyperparameters. The best hyperparameter combination found was: ‘C=100’, 'kernel': 'rbf' and and ‘gamma=0.001’, which gave us a RMSE score of 0.4448220618029941.







    GridSearchCV for Random Forest Model

    We will use the Random Forest Model and apply the GridSearch Cross Validation to select the best hyperparameters for the model.




    The best parameters for the Random forest model are : 'max_features': 8, 'n_estimators': 30.



    Analysing the best models, their errors and selecting the best model


    With the Fine-tuned SVC Model, we made predictions:


    • Best score: 0.7953091684434968
    • Best RMSE: 0.4524277086524467

    The accuracy score of 0.79, is perfect to enable the Telecom agency to predict efficiently churning customers and to optimize its retention strategy.




    For the Random Forest Model, we found a RMSE score of 0.3888695770388026.
    But the best model performance is only 0.22 on the test data, meaning that the model is not performing well on new data (test data). We DO NOT recommend using this model for churn predictions.



    6. Evaluating our best estimator on the test Set and conclusion

    With 0.795 testing accuracy, the SVC model is confirming that is it the best model for the requested business application. Its training accuracy (0.79) and test accuracy are similar (0.795) and show that the model generalizes very well to the test set (new data).

    Besides, its classification reports shows benefits providing a trade-off between the precision of the model (0.63%, the recall 0.48) and optimizing the investments Vs other models in order to retain churning customers.

    The best model fine-tuned for the random forest gave us only a score of 0.22, so it is not retained.

    But the model has some limitations :
    The first one is the size of our Dataset, we have only 7032 rows of customer data, which is a small dataset considering the business application of the predictions, particularly when the most profitable customers segment we need to retain is even the smallest (395 i.e 5.6% of our Dataset). Besides, we identified two attributes as important in our model predictions : the tenure and monthly charges. And as we mentioned these are linked to the Customer Lifetime Value (CLV). We thus recommend to hold a second study using unsupervised models to clusters customer segments and link the business retention strategy to the CLV.

    We can further analyze some attributes combinations or add some features (if we can have more data) to search for a better model performance.

    From business prospective, we recommend the telecom company to run customer surveys and gather customers'feedbacks on services enfavoring churn or enfavoring customers'retention.

    Indeed, we have noticed that some services are prone to churn :




    On the contrary, other services, are more enfavoring customer retention or loyalty :



    We would also like to analyze more the data we have, such as modifying the features attributes considering the following adjustments :


    We need to push further this first shot analysis and modify our features by changing some attributes combinations.


    A direction for even further analysis would be to investigate :





    If you want to see the codes used throughout our analysis, you can find them by clicking on this link.






    Meme's source: https://www.pinterest.fr/pin/686658274414696260/