Adhang Muntaha

Logo

Welcome to my GitHub Page!

This is my portfolio on data science and machine learning. It contains some of my projects which I used to hone my knowledge and skills.

LendingClub - Loan Credit Risk Prediction

This is my internship project as a data scientist at id/x partners

In this project, I designed a predictive model to determine the probability of a borrower’s default risk in a lending company and achieved a 98% accuracy score.

Project Notebooks

Readme Card

I create separated notebooks due to my limited computing resources

Dataset & Business Understanding

Dataset Information

Attribute Information

Company Goals
Increasing profit! But how can we achieve it? Some ways to increase profits are:

Problems

Objectives

Exploratory Data Analysis

What Happened?

The Good status is when the loan status is either Current or Fully Paid, otherwise the status is Bad (risky)

Target Distribution

Who are The Borrowers?

Employment Title Wordcloud

Why Did They Apply for a Loan?

Loan Purpose Distribution

What is Their Grade?

Grade Distribution

Do Grades Matter?

Loan Status Probability by Grade

Loan Credit Risk Probability by Date Features

Issue Date

Loan Status Probability by Issue Date

Last Payment Date

Loan Status Probability by Last Payment Date

Do Interest Rates Matter?

Loan Status Probability by Interest Rate

Attribute Associations to Loan Status

I did some feature selection based on:

Below is the attribute associations to loan status after feature selection

Attribute Associations to Loan Status

Data Preprocessing

I do some data preprocessing, such as:

Model Development

I use XGBoost and LightGBM for model development. Below are the metric scores for the model with the default hyperparameter. For simplicity reason, I only show the accuracy, F1 score, and harmonic mean of accuracy and F1 score.

    Accuracy F1 Score Harmonic Mean
XGBoost Using All Features 0.950 0.809 0.874
Using 75% Features 0.943 0.792 0.861
Using 50% Features 0.922 0.740 0.821
Using 25% Features 0.905 0.700 0.789
LightGBM Using All Features 0.974 0.882 0.926
Using 75% Features 0.974 0.883 0.926
Using 50% Features 0.969 0.865 0.914
Using 25% Features 0.955 0.822 0.884



Overall, the LightGBM model performs better than the XGBoost model. What if we do some tuning for the hyperparameters?

Model Optimization

I use Optuna for hyperparameter tuning. My tuning strategy follows business goals:



    Accuracy F1 Score Harmonic Mean
XGBoost Using All Features 0.971 0.875 0.921
Using 75% Features 0.971 0.876 0.921
Using 50% Features 0.969 0.867 0.915
Using 25% Features 0.955 0.826 0.886
LightGBM Using 100% Features 0.975 0.891 0.931
Using 75% Features 0.975 0.890 0.931
Using 50% Features 0.972 0.877 0.922
Using 25% Features 0.963 0.850 0.903



Conclusion

Final Model
LightGBM using 75% features and get:

Recommendation and Request

Explainable AI

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model.

See papers for details and citations.

SHAP Multiple Decision Plot