...

/

Solution: Classifier Model Feature Engineering

Solution: Classifier Model Feature Engineering

Understand how to add new features to help predict loan status.

We'll cover the following...

The magic unfolds!

As we discussed in the previous lessons, feature engineering is crucial for building robust, effective, and high-performing machine learning models. In the classifier challenge, our task was to beat the auc threshold of 0.90850.9085 on the unseen test dataset by performing feature engineering on the input features.

Click the “Run” button below to see the impact of a new set of features (lines 14–22) that significantly enhance the model’s performance and beat the required threshold.

Press + to interact
import h2o
import pandas as pd
from h2o.estimators import H2OGradientBoostingEstimator as H2OGBM
from sklearn.metrics import auc
import numpy as np
h2o.init(port=8080)
filepath = "Data/Classification/"
data = h2o.import_file(filepath+'Lending_Club_Loans.csv')
# Converting the H2O Frame to a Pandas DataFrame
data = data.as_data_frame()
# interaction feature "PaymentToIncome" based on domain knowledge
data['PaymentToIncome'] = (data['installment']*12.0)/data['annual_inc']
# new feature by transforming few continous variable to categorical ones
data['dti_cat'] = pd.qcut(data['dti'], q=[0, .25, .5, .75, 1.], labels=['Low','Moderate','Medium','High'])
data['int_rate_cat'] = pd.qcut(data['int_rate'], q=[0, .25, .5, .75, 1.], labels=['Low','Moderate','Medium','High'])
# changing existing revol_util feature from continous to categorical
data['revol_util'] = pd.qcut(data['revol_util'], q=[0, .25, .5, .75, 1.], labels=['Low','Moderate','Medium','High'])
data = h2o.H2OFrame(data)
# Checking the set of input features once again.
X = list(set(data.names) - set("loan_status")) # removing target from the list of all columns
y = "loan_status"
splits = data.split_frame(ratios = [0.70, 0.15], seed = 1)
train = splits[0]
valid = splits[1]
test = splits[2]
# Set up the H2O GBM parameters
gbm = H2OGBM(max_runtime_secs=40, nfolds=0, seed=42,
stopping_rounds=5,
ntrees= 1000,
max_depth = 5,
sample_rate= 0.64,
col_sample_rate= 0.67,
col_sample_rate_per_tree= 0.71,
col_sample_rate_change_per_level= 1.01,
score_tree_interval=5,
stopping_metric='AUC')
# Train the model
gbm.train(x=X, y=y, training_frame=train, validation_frame=valid)
print(gbm.model_performance(test_data=test))

Let’s understand the whole implementation step by step:

  • Importing libraries: We start by importing necessary libraries in lines 1–5 ...