LendingClub - Loan Credit Risk Prediction
This is my internship project as a data scientist at id/x partners
In this project, I designed a predictive model to determine the probability of a borrower’s default risk in a lending company and achieved a 98% accuracy score.
Project Notebooks
I create separated notebooks due to my limited computing resources
Dataset & Business Understanding
Dataset Information
- This dataset contains borrowers information from a lending company, named LendingClub (LC for short) from 2007 to 2014
- This company has various offerings such as loans, banking, and investments
Attribute Information
- Identifier
id
- A unique LC assigned ID for the loan listing
member_id
- A unique LC assigned ID for the borrower member
- Target Variable
loan_status
- Current status of the loan, whether it’s a good or bad (risky)
- More detailed attribute information can be found here
Company Goals
Increasing profit! But how can we achieve it? Some ways to increase profits are:
- Accepting applicants who will definitely pay off their loans
- Declining applicants who don’t want to pay off the loan (potential to be defaulters)
Problems
- Credit risk is the possibility of a loss resulting from a borrower’s failure to repay a loan or meet contractual obligations (source)
- When a lending company receives a loan application, the company has to make a decision whether the company will accept or decline based on the applicant’s profile
- If the applicant is likely to pay off the loan but we don’t approve their application, it may result in a loss of income for the company
- If the applicant is not likely to pay off the loan but we approve their application, it may result in financial loss for the company
Objectives
- Predict whether the borrower will pay off the loan or not
- Understanding the borrower behaviors:
- What makes the borrower pay off the loan
- What makes the borrower doesn’t pay off the loan
Exploratory Data Analysis
What Happened?
The Good
status is when the loan status is either Current
or Fully Paid
, otherwise the status is Bad
(risky)
- There are 12% of borrowers who have a risky loan status
- Technically speaking, this dataset is an imbalanced dataset
Who are The Borrowers?
- Many borrowers have the words
Manager
, Service
, Director
, Assistant
, Sale
, Teacher
, or Nurse
in their employment title
- Many borrowers didn’t write their employment title, so it’s marked as
Unknown
Why Did They Apply for a Loan?
- Most borrowers apply for loans for the purpose of debt consolidation
What is Their Grade?
- Most borrowers have grade B and C
Do Grades Matter?
- This feature seems to have a natural order based on the loan status probability
- Grade A has the highest probability to have a good loan status.
- Grade G has the lowest probability to have a good loan status
Loan Credit Risk Probability by Date Features
Issue Date
- The earlier the issue date is, the higher the probability of a borrower to have a bad loan status
Last Payment Date
- If the last payment has been made a long time ago, then the probability of a borrower to have a bad loan status will be higher
Do Interest Rates Matter?
- Borrowers with high-interest rates have a higher probability to have a bad loan status than those with a low-interest rate
Attribute Associations to Loan Status
I did some feature selection based on:
- Feature cardinality
Feature with high cardinality was dropped
- Feature associations
Feature with very low association (almost zero) to loan status was dropped
- Multicollinearity & redundant values
Drop one (or more) of the highly correlated features
Below is the attribute associations to loan status after feature selection
Data Preprocessing
I do some data preprocessing, such as:
- Imputing missing values
- Removing redundant features
- Reducing feature skewness
- Feature extraction
- Feature transformation (encoding, scaling)
- Oversampling with SMOTE
Model Development
I use XGBoost and LightGBM for model development. Below are the metric scores for the model with the default hyperparameter. For simplicity reason, I only show the accuracy, F1 score, and harmonic mean of accuracy and F1 score.
|
|
Accuracy |
F1 Score |
Harmonic Mean |
XGBoost |
Using All Features |
0.950 |
0.809 |
0.874 |
Using 75% Features |
0.943 |
0.792 |
0.861 |
Using 50% Features |
0.922 |
0.740 |
0.821 |
Using 25% Features |
0.905 |
0.700 |
0.789 |
LightGBM |
Using All Features |
0.974 |
0.882 |
0.926 |
Using 75% Features |
0.974 |
0.883 |
0.926 |
Using 50% Features |
0.969 |
0.865 |
0.914 |
Using 25% Features |
0.955 |
0.822 |
0.884 |
Overall, the LightGBM model performs better than the XGBoost model. What if we do some tuning for the hyperparameters?
Model Optimization
I use Optuna for hyperparameter tuning. My tuning strategy follows business goals:
- I want to avoid high false negatives in the risky class to minimize financial loss, therefore I have to maximize the recall score
- However, I also want to avoid high false positives in the risky class to minimize the loss of income, therefore I have to maximize the precision score as well
- To overcome these conditions, I will optimize the F1 score because it is the harmonic mean of precision and recall
- I use the F1 score from the negative class because I give more attention to optimizing the metrics for bad loan status
- I’m still paying attention to the accuracy score as well since this metric is easier to interpret
|
|
Accuracy |
F1 Score |
Harmonic Mean |
XGBoost |
Using All Features |
0.971 |
0.875 |
0.921 |
Using 75% Features |
0.971 |
0.876 |
0.921 |
Using 50% Features |
0.969 |
0.867 |
0.915 |
Using 25% Features |
0.955 |
0.826 |
0.886 |
LightGBM |
Using 100% Features |
0.975 |
0.891 |
0.931 |
Using 75% Features |
0.975 |
0.890 |
0.931 |
Using 50% Features |
0.972 |
0.877 |
0.922 |
Using 25% Features |
0.963 |
0.850 |
0.903 |
Conclusion
Final Model
LightGBM using 75% features and get:
- Accuracy: 98%
- F1 Score: 89%
Recommendation and Request
- We should pay more attention to borrowers who meet the criteria below
- Earlier issue date
- High interest rate
- Evaluate and do some adjustment to the interest rate. Maybe we can adjust the interest rate based on borrowers’ default risk probability.
- Use targeted ads for potential new borrowers based on their needs and occupations
Explainable AI
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.
See papers for details and citations.
×