Introduction to the Dataset for the Course
Explore the loan approval dataset designed for binary classification to predict customer eligibility for home loans. Learn about the dataset features and how this real-world data will be used throughout the course to develop machine learning models and apply hyperparameter optimization techniques.
We'll cover the following...
Problem statement
A company known as Dream Housing Finance offers a wide variety of home loans. They maintain a presence in all of the urban, semi-urban, and rural regions of the country. The process begins with the customer submitting an application for a home loan, and it is followed by the company’s efforts to cross-check the information provided in the application and then verify the customer’s eligibility for the loan.
The company wants to be able to automatically determine, in real-time, if a customer is eligible for the loan they’ve applied for based on the information they provide in their online loan applications.
They have provided a dataset to automate this process, which will identify the customer segments that are qualified for loan amounts. This will allow them to specifically target these customers.
The loan approval dataset
In this course, we’ll utilize the loan dataset, which is a binary classification dataset consisting of loan details and the status of different customers. The aim is to develop an ML model that predicts if a customer’s request for a loan can be approved or not.
Binary classification is a type of supervised learning in ML where the goal is to classify input data into one of two possible categories. The categories are typically represented as:
0 and 1
True and false
Positive and negative
Note: These categories can also be presented in different ways.
The ML algorithm is trained on a labeled dataset, where each data point is linked to the correct category label. The objective of the ML algorithm is to learn a decision boundary that separates the two classes. Once the ML model is trained, it can be used to predict the category of new, unseen data points.
Binary classification is used in many different ways, such as to detect spam, fraud, and medical diagnoses.
Here are a few sample rows of the loan approval dataset:
Loan Approval Dataset
Loan_ID | Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoApplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
LP001002 | Male | No | 0 | Graduate | No | 5849 | 0 | 267 | 360 | 1 | Urban | Y |
LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508 | 128 | 360 | 1 | Rural | N |
LP001013 | Female | No | 0 | Graduate | No | 3510 | 0 | 76 | 360 | 0 | Urban | N |
Features of the dataset
The dataset has the following columns:
Loan_ID: Unique loan IDGender: Male/ FemaleMarried: Applicant married (Y/N)Dependents: Number of dependentsEducation: Applicant’s education (Graduate/ Undergraduate)Self_Employed: Self-employed (Y/N)ApplicantIncome: Applicant incomeCoapplicantIncome: Coapplicant incomeLoanAmount: Loan amount in thousandsLoan_Amount_Term: Term of the loan in monthsCredit_History: Credit history meets guidelinesProperty_Area: Urban/ Semi urban/ RuralLoan_Status: Loan approved (Y/N)
The loan status column has two classes: Y or N.
Y: If the loan is approved, it signifies a “Yes.”
N: If the loan is not approved, it signifies a “No.”
We’ll develop an ML model using this dataset. The model will be able to classify if a customer’s request for a loan can be approved or not. Therefore, we will solve this classification-based ML challenge in this course. We’ll also apply different hyperparameter optimization techniques to improve the performance of the ML model.
Note: This course’ll only use the “labeledTrainData” dataset, which has 614 customer loan details and 13 column features presented above.