LendingClub - Loan Credit Risk Prediction
This is my internship project as a data scientist at id/x partners
In this project, I designed a predictive model to determine the probability of a borrower’s default risk in a lending company and achieved a 98% accuracy score.
Project Notebooks

I create separated notebooks due to my limited computing resources
Dataset & Business Understanding
Dataset Information
- This dataset contains borrowers information from a lending company, named LendingClub (LC for short) from 2007 to 2014
- This company has various offerings such as loans, banking, and investments
Attribute Information
- Identifier
id - A unique LC assigned ID for the loan listing
member_id - A unique LC assigned ID for the borrower member
- Target Variable
loan_status - Current status of the loan, whether it’s a good or bad (risky)
- More detailed attribute information can be found here
Company Goals
Increasing profit! But how can we achieve it? Some ways to increase profits are:
- Accepting applicants who will definitely pay off their loans
- Declining applicants who don’t want to pay off the loan (potential to be defaulters)
Problems
- Credit risk is the possibility of a loss resulting from a borrower’s failure to repay a loan or meet contractual obligations (source)
- When a lending company receives a loan application, the company has to make a decision whether the company will accept or decline based on the applicant’s profile
- If the applicant is likely to pay off the loan but we don’t approve their application, it may result in a loss of income for the company
- If the applicant is not likely to pay off the loan but we approve their application, it may result in financial loss for the company
Objectives
- Predict whether the borrower will pay off the loan or not
- Understanding the borrower behaviors:
- What makes the borrower pay off the loan
- What makes the borrower doesn’t pay off the loan
Exploratory Data Analysis
What Happened?
The Good status is when the loan status is either Current or Fully Paid, otherwise the status is Bad (risky)

- There are 12% of borrowers who have a risky loan status
- Technically speaking, this dataset is an imbalanced dataset
Who are The Borrowers?

- Many borrowers have the words
Manager, Service, Director, Assistant, Sale, Teacher, or Nurse in their employment title
- Many borrowers didn’t write their employment title, so it’s marked as
Unknown
Why Did They Apply for a Loan?

- Most borrowers apply for loans for the purpose of debt consolidation
What is Their Grade?

- Most borrowers have grade B and C
Do Grades Matter?

- This feature seems to have a natural order based on the loan status probability
- Grade A has the highest probability to have a good loan status.
- Grade G has the lowest probability to have a good loan status
Loan Credit Risk Probability by Date Features
Issue Date

- The earlier the issue date is, the higher the probability of a borrower to have a bad loan status
Last Payment Date

- If the last payment has been made a long time ago, then the probability of a borrower to have a bad loan status will be higher
Do Interest Rates Matter?

- Borrowers with high-interest rates have a higher probability to have a bad loan status than those with a low-interest rate
Attribute Associations to Loan Status
I did some feature selection based on:
- Feature cardinality
Feature with high cardinality was dropped
- Feature associations
Feature with very low association (almost zero) to loan status was dropped
- Multicollinearity & redundant values
Drop one (or more) of the highly correlated features
Below is the attribute associations to loan status after feature selection

Data Preprocessing
I do some data preprocessing, such as:
- Imputing missing values
- Removing redundant features
- Reducing feature skewness
- Feature extraction
- Feature transformation (encoding, scaling)
- Oversampling with SMOTE
Model Development
I use XGBoost and LightGBM for model development. Below are the metric scores for the model with the default hyperparameter. For simplicity reason, I only show the accuracy, F1 score, and harmonic mean of accuracy and F1 score.
| |
|
Accuracy |
F1 Score |
Harmonic Mean |
| XGBoost |
Using All Features |
0.950 |
0.809 |
0.874 |
| Using 75% Features |
0.943 |
0.792 |
0.861 |
| Using 50% Features |
0.922 |
0.740 |
0.821 |
| Using 25% Features |
0.905 |
0.700 |
0.789 |
| LightGBM |
Using All Features |
0.974 |
0.882 |
0.926 |
| Using 75% Features |
0.974 |
0.883 |
0.926 |
| Using 50% Features |
0.969 |
0.865 |
0.914 |
| Using 25% Features |
0.955 |
0.822 |
0.884 |
Overall, the LightGBM model performs better than the XGBoost model. What if we do some tuning for the hyperparameters?
Model Optimization
I use Optuna for hyperparameter tuning. My tuning strategy follows business goals:
- I want to avoid high false negatives in the risky class to minimize financial loss, therefore I have to maximize the recall score
- However, I also want to avoid high false positives in the risky class to minimize the loss of income, therefore I have to maximize the precision score as well
- To overcome these conditions, I will optimize the F1 score because it is the harmonic mean of precision and recall
- I use the F1 score from the negative class because I give more attention to optimizing the metrics for bad loan status
- I’m still paying attention to the accuracy score as well since this metric is easier to interpret
| |
|
Accuracy |
F1 Score |
Harmonic Mean |
| XGBoost |
Using All Features |
0.971 |
0.875 |
0.921 |
| Using 75% Features |
0.971 |
0.876 |
0.921 |
| Using 50% Features |
0.969 |
0.867 |
0.915 |
| Using 25% Features |
0.955 |
0.826 |
0.886 |
| LightGBM |
Using 100% Features |
0.975 |
0.891 |
0.931 |
| Using 75% Features |
0.975 |
0.890 |
0.931 |
| Using 50% Features |
0.972 |
0.877 |
0.922 |
| Using 25% Features |
0.963 |
0.850 |
0.903 |
Conclusion
Final Model
LightGBM using 75% features and get:
- Accuracy: 98%
- F1 Score: 89%
Recommendation and Request
- We should pay more attention to borrowers who meet the criteria below
- Earlier issue date
- High interest rate
- Evaluate and do some adjustment to the interest rate. Maybe we can adjust the interest rate based on borrowers’ default risk probability.
- Use targeted ads for potential new borrowers based on their needs and occupations
Explainable AI
SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.
See papers for details and citations.

×