LendingClub - Loan Credit Risk Prediction

This is my internship project as a data scientist at id/x partners

In this project, I designed a predictive model to determine the probability of a borrower’s default risk in a lending company and achieved a 98% accuracy score.

Project Notebooks

I create separated notebooks due to my limited computing resources

Dataset & Business Understanding

Dataset Information

This dataset contains borrowers information from a lending company, named LendingClub (LC for short) from 2007 to 2014
This company has various offerings such as loans, banking, and investments

Attribute Information

Identifier
- id - A unique LC assigned ID for the loan listing
- member_id - A unique LC assigned ID for the borrower member
Target Variable
- loan_status - Current status of the loan, whether it’s a good or bad (risky)
More detailed attribute information can be found here

Company Goals
Increasing profit! But how can we achieve it? Some ways to increase profits are:

Accepting applicants who will definitely pay off their loans
Declining applicants who don’t want to pay off the loan (potential to be defaulters)

Problems

Credit risk is the possibility of a loss resulting from a borrower’s failure to repay a loan or meet contractual obligations (source)
When a lending company receives a loan application, the company has to make a decision whether the company will accept or decline based on the applicant’s profile
If the applicant is likely to pay off the loan but we don’t approve their application, it may result in a loss of income for the company
If the applicant is not likely to pay off the loan but we approve their application, it may result in financial loss for the company

Objectives

Predict whether the borrower will pay off the loan or not
Understanding the borrower behaviors:
- What makes the borrower pay off the loan
- What makes the borrower doesn’t pay off the loan

Exploratory Data Analysis

What Happened?

The Good status is when the loan status is either Current or Fully Paid, otherwise the status is Bad (risky)

Target Distribution

There are 12% of borrowers who have a risky loan status
Technically speaking, this dataset is an imbalanced dataset

Who are The Borrowers?

Employment Title Wordcloud

Many borrowers have the words Manager, Service, Director, Assistant, Sale, Teacher, or Nurse in their employment title
Many borrowers didn’t write their employment title, so it’s marked as Unknown

Why Did They Apply for a Loan?

Loan Purpose Distribution

Most borrowers apply for loans for the purpose of debt consolidation

What is Their Grade?

Grade Distribution

Most borrowers have grade B and C

Do Grades Matter?

Loan Status Probability by Grade

This feature seems to have a natural order based on the loan status probability
Grade A has the highest probability to have a good loan status.
Grade G has the lowest probability to have a good loan status

Loan Credit Risk Probability by Date Features

Issue Date

Loan Status Probability by Issue Date

The earlier the issue date is, the higher the probability of a borrower to have a bad loan status

Last Payment Date

Loan Status Probability by Last Payment Date

If the last payment has been made a long time ago, then the probability of a borrower to have a bad loan status will be higher

Do Interest Rates Matter?

Loan Status Probability by Interest Rate

Borrowers with high-interest rates have a higher probability to have a bad loan status than those with a low-interest rate

Attribute Associations to Loan Status

I did some feature selection based on:

Feature cardinality
Feature with high cardinality was dropped
Feature associations Feature with very low association (almost zero) to loan status was dropped
Multicollinearity & redundant values Drop one (or more) of the highly correlated features

Below is the attribute associations to loan status after feature selection

Attribute Associations to Loan Status

Data Preprocessing

I do some data preprocessing, such as:

Imputing missing values
Removing redundant features
Reducing feature skewness
Feature extraction
Feature transformation (encoding, scaling)
Oversampling with SMOTE

Model Development

I use XGBoost and LightGBM for model development. Below are the metric scores for the model with the default hyperparameter. For simplicity reason, I only show the accuracy, F1 score, and harmonic mean of accuracy and F1 score.

		Accuracy	F1 Score	Harmonic Mean
XGBoost	Using All Features	0.950	0.809	0.874
	Using 75% Features	0.943	0.792	0.861
	Using 50% Features	0.922	0.740	0.821
	Using 25% Features	0.905	0.700	0.789
LightGBM	Using All Features	0.974	0.882	0.926
	Using 75% Features	0.974	0.883	0.926
	Using 50% Features	0.969	0.865	0.914
	Using 25% Features	0.955	0.822	0.884

Overall, the LightGBM model performs better than the XGBoost model. What if we do some tuning for the hyperparameters?

Model Optimization

I use Optuna for hyperparameter tuning. My tuning strategy follows business goals:

I want to avoid high false negatives in the risky class to minimize financial loss, therefore I have to maximize the recall score
However, I also want to avoid high false positives in the risky class to minimize the loss of income, therefore I have to maximize the precision score as well
To overcome these conditions, I will optimize the F1 score because it is the harmonic mean of precision and recall
I use the F1 score from the negative class because I give more attention to optimizing the metrics for bad loan status
I’m still paying attention to the accuracy score as well since this metric is easier to interpret

		Accuracy	F1 Score	Harmonic Mean
XGBoost	Using All Features	0.971	0.875	0.921
	Using 75% Features	0.971	0.876	0.921
	Using 50% Features	0.969	0.867	0.915
	Using 25% Features	0.955	0.826	0.886
LightGBM	Using 100% Features	0.975	0.891	0.931
	Using 75% Features	0.975	0.890	0.931
	Using 50% Features	0.972	0.877	0.922
	Using 25% Features	0.963	0.850	0.903

Conclusion

Final Model
LightGBM using 75% features and get:

Accuracy: 98%
F1 Score: 89%

Recommendation and Request

We should pay more attention to borrowers who meet the criteria below
- Earlier issue date
- High interest rate
Evaluate and do some adjustment to the interest rate. Maybe we can adjust the interest rate based on borrowers’ default risk probability.
Use targeted ads for potential new borrowers based on their needs and occupations

Explainable AI

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.

See papers for details and citations.

SHAP Multiple Decision Plot