Regression Model and Prediction

Learn how to build and tune parameters of regression models using H2O.

H2OXGBoost is an implementation of the XGBoost algorithm within the H2O framework. It’s based on the tree-based gradient boosting algorithm that uses an ensemble of weak learners to produce a final model. H2OXGBoost is well known for its high performance and scalability on various machine learning tasks.

Regression model: H2OXGBoostEstimator

In this lesson, we’re going to train the H2OXGBoostEstimator regression model to predict flight fares from the airline dataset. The H2OXGBoostEstimator algorithm uses gradient boosting to train ensemble models. Gradient boosting is a powerful machine learning technique that uses an ensemble of weak prediction models, typically decision trees, to create a more accurate prediction model.

H2OXGBoostEstimator is based on the XGBoost library, which is a popular implementation of gradient boosting. The key advantage of H2OXGBoostEstimator is that it can handle a large number of features and data points, making it well suited for big data problems. It also supports parallel processing, which allows it to scale up to large datasets. Some of its important parameters include:

  • ntrees: Controls the number of trees. Increasing the number of trees can improve the model’s performance but may also lead to overfitting.

  • max_depth: Controls the maximum depth of each tree. Increasing this parameter can make the model more complex and fit the training data better, but it also increases the risk of overfitting.

  • min_rows: Sets the minimum number of observations required in a terminal node of a tree. Setting this parameter higher can reduce the risk of overfitting, but it may also reduce the model’s ability to capture complex patterns in the data.

  • learn_rate: Controls the learning rate, also known as step size, used in the gradient descent algorithm. A higher learning rate can help the model converge faster but may also overshoot the optimal solution.

  • subsample: Controls the fraction of observations used for each tree. Setting this parameter lower can reduce overfitting, but it may also reduce the model’s ability to capture complex patterns in the data.

  • col_sample_rate_per_tree: Controls the fraction of features used for each tree. Setting this parameter lower can reduce overfitting, but it affects the model’s ability to capture complex patterns.

  • reg_lambda: Controls the L2 regularization term in the objective function. Increasing this parameter can reduce overfitting, but it may also make the model less likely to capture complex patterns in the data.

In our previous lesson, we discovered that the H2O AutoML chose the H2OXGBoostEstimator as the winning model for predicting flight fares on our airline dataset. We’ll now concentrate on working with the H2OXGBoostEstimator model to further improve its performance.

Let’s define the parameters that we’ll use in our H2OXGBoost model:

Get hands-on with 1400+ tech skills courses.