ML algorithms 1.09: Gradient Boosting Machines

Anupam Misra
Geek Culture
Published in
4 min readMay 24, 2021

--

Souce: Shapelined

Introduction

Gradient Boosting Machines is a boosting ensemble technique. Boosting algorithms perform better because both variance and bias can be controlled by careful hyperparameter tuning. GBMs use shallow decision trees as compared to stumps in AdaBoost. They fit between AdaBoost and XGBoost in prediction/estimation performance. The advantages of GBM are better performance than AdaBoost but simpler mechanics than XGBoost.

Model

Let X be the feature set with m samples and n features. Let y be the continuous response.

Parameters of the model are represented as:

For regression:

For classification:

Here the probability of the class k predicted through softmax function is given by:

In GBM classification we build K parallel trees for K classes.

When we create the first tree, it only has the root node, which is the mean of all responses for regression problems. For classification problems the softmax value for the root node is the fraction of that class k out of all classes in the data initially. For each tree t that we construct subsequently, we try to fit the tree to the residual. In classification residuals are calculated the same as in regression trees because in effect, we are building regression trees for each class k to match the probability of class k. The residual is calculated as:

We fit the tree t to the residuals using MSE or some other splitting criteria for both regression and classification. Different splits happen as compared to the previous tree because the tree is being fit to the residuals which are approaching the actual y with each tree being constructed. After that we will have to optimize the loss function L as:

Boosted trees are shallow. For each residual, depending on its features(n no. of features), it reaches a particular leaf in the tree t. Total m observations(after bootstrapping) will reach the leaves. So, generally more than one observation reaches each leaf. We need to find the optimum value for each leaf in trees in a sequential manner.

For regression we need to find the gradient of the loss function L. The loss function for regression is given by MSE:

For classification we need to find the gradient of the loss function L for class k. The loss function for GBM classification problem is given by multinomial deviance:

Roughly speaking, at tree t for class k, in each leaf, γ is the probability of the class being k. By solving the above equation we get the optimum γ for each leaf. With construction of each tree we move in the direction of negative gradient of the Loss function L approaching the actual y.

After the tree t is built, the tree t outputs the estimates as:

For classification, the class having the highest probability is given as the output.

The residual for the next tree t is calculated as:

In this way the trees are built till the maximum no. of trees are built or the residual is below some threshold ε.

Prediction

Final prediction of the boosted machine is the summation of predictions of all the trees weighed by the learning rate.

References:

I have referred to some material presented by Josh Starmer in his Gradient Boost videos. The rest of the content is from the course material taught at school.

--

--