Welcome to my GitHub Page!
This is my portfolio on data science and machine learning. It contains some of my projects which I used to hone my knowledge and skills.
In this project, I designed a predictive model to determine the probability that customers will leave the service (churn) or continue to use the service (retain) at a telco company and achieve a sensitivity score of 80%.
In working on this project, I used a workflow based on the CRISP-DM model, starting from business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
Dataset Information
Attribute Information
customerID
- ID number of the customerChurn
- Churn status, whether the customer churned or notgender
- Whether the customer is a male or a femaleSeniorCitizen
- Whether the customer is a senior citizen or notPartner
- Whether the customer has a partner or notDependents
- Whether the customer has dependents or nottenure
- Number of months the customer has used the serviceContract
- The contract term of the customerPaperlessBilling
- Whether the customer has paperless billing or notPaymentMethod
- The customer’s payment methodMonthlyCharges
- The amount charged to the customer monthlyTotalCharges
- The total amount charged to the customerPhoneService
- Whether the customer has a phone service or notMultipleLines
- Whether the customer has multiple lines or notInternetService
- Customer’s internet service providerOnlineSecurity
- Whether the customer has online security or notOnlineBackup
- Whether the customer has online backup or notDeviceProtection
- Whether the customer has device protection or notTechSupport
- Whether the customer has tech support or notStreamingTV
- Whether the customer has streaming TV or notStreamingMovies
- Whether the customer has streaming movies or notNote: Since this dataset is using CamelCase
format for the column names, for this project, I will convert it to snake_case
format.
Company Goals
Increasing profit! But how can we achieve it? Some of the way to increase profit are:
Problems
Objectives
27% customers leave us!
I do some data preprocessing, such as:
I tried several machine learning algorithms, such as:
Overall, boosting methods show a good performance. Then, I tried to compare some feature selection methods and hyperparameter tuning to see if the performance of boosting methods can be improved.
My tuning strategy focuses on optimizing the positive recall value (not the average) to minimize the occurrence of false negatives, which is when we incorrectly predict customers who actually churn as non-churn. This is because the cost of acquiring new customers is more expensive than retaining existing customers. But, I still pay attention to the accuracy score as well.
To do model selection, I use the harmonic mean (F-beta) of accuracy and recall.
accuracy | recall | fbeta | |
---|---|---|---|
Gradient Boosting Classifier | 0.775 | 0.766 | 0.771 |
AdaBoost Classifier | 0.759 | 0.783 | 0.770 |
CatBoost Classifier | 0.761 | 0.765 | 0.763 |
Hist Gradient Boosting | 0.756 | 0.781 | 0.768 |
XGBoost | 0.761 | 0.779 | 0.770 |
LightGBM | 0.762 | 0.791 | 0.777 |
Final Model
LightGBM with feature selection using filter method and get:
Recommendation and Request
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.
See papers for details and citations.
I had deployed my model on a web app using Flask and Heroku. You can try it here