Gradient Boosting

Krishna
6 min readFeb 22, 2022

--

We are only as strong as we are united, as weak as we are divided.

— J.K. Rowling, Harry Potter and the Goblet of Fire

This quote very well explains boosting, we can be weak and small but if we are united and constantly learning we all together can be strong. The same principle is applied in boosting methodology.

Boosting is a methodology that is applied to the already existing machine learning algorithm, mostly applied on DT because they produce the best results (We can also use them on other ML models). They are also referred to as Meta-learning.

Often people are confused thinking that boosting is an algorithm but they are just mythologies, you can think of them as catalysts for an experiment. Gradient boosting is one such boosting method that can be used on a machine-learning algorithm often on Decision trees.

There are two types of boosting methods :

  1. Adaptive boosting
  2. Gradient boosting.

What is Gradient?

The gradient is nothing but the slope, it is the measure of the steepness of a straight line. In machine learning we use gradients to find the minimum possible error, so gradient boosting also works on something similar to this.

Unlike Adaptive boosting (where small stumps are built and then connected in series), gradient boosting allows us to build larger trees and the learning rate also will be the same for all the weak learners. Here the learner (the single DT) is trained based on the residuals of the previous model.

How does It work?

Let’s start with an example, Consider the below dataset, which has three features. ‘Mileage’, ‘engine volume’ and ‘year’, we need to predict the Prices based on these three features. (This is just a basic example, in reality to predict the price we need some more features)

Step 1: Find the loss.

While solving a regression problem loss function is the MSE which is also called the cost function, so in the first step, we are going to find the loss for the base model.

[Note: Base model is the initial model without considering any parameters, in general, the output of this base model will be the mean of all the predictions]

F(x) = ( i = 1 to n)∑ Lr (y, P) — — -> 1

Here, P is the value that is predicted by the model (In some books or websites it is denoted as gamma)
Lr is the learning rate (ranging between 0 to 1)
y is the true value.

It is the values the base model is going to predict when the training data is passed.

Loss function = 1/2 (y — yp) ^2

Here, y is the true value and ‘yp’ is the predicted value.

Gradient boosting is all about minimizing this loss for each prediction, so to minimize the loss we need to take the derivative of the loss function for yp.

Step 2: Find the derivative of the loss function.

dl/dyp = 2/2 (y — yp) (-1)
dl/dyp = yp -y

So, if we try to find out the loss derivative for each data row in the data set,

dl/dyp (first row) = yp — 22345
dl/dyp (second row) = yp — 20321
dl/dyp (third row) = yp — 16903
dl/dyp (fourth row) = yp — 18907
dl/dyp (fifth row) = yp -30212

Step 3: Add them up to find the new outputs.

In equation one, F(x) = ( i = 1 to n)∑ Lr (y, P),

=> (yp — 22345) + (yp — 20321) + (yp — 16903) + (yp — 18907) + ( yp -30212)

=> 5yp — 108688

5yp = 108688

yp = 21737.6

This will be the new y value predicted by the base model.

Step 4: Calculate the Residuals.

Residuals are the differences between actual values of data. They are used to access the quality of the model, sometimes they are also referred to as errors.
R = yt — yp
Here, it is the true value and yp are the predicted values.

Step 5: update the residuals by multiplying with some Lr.

New residuals = Lr * R

Consider the Lr = 0.1

Here ‘learning rate’ is the step size (“ To find the minimum cost, we iteratively calculate ‘m’ and ‘b’ values for each data point and compute their partial derivatives, store these as a new gradient. The new gradient, tells us the slope of the cost function at the current position, according to this value we will update our parameters. For every update, we need to take a step which is called the learning rate. If the learning rate is too high, then the model could cover the distance too quickly and cannot produce an optimal value, whereas a learning rate that is too small can cause the process to struck.”)

This whole process is continued until the residual values are reached to a minimum. To summarise,

Considering the above figure, the following process is followed…

  1. The base model had predicted the outputs
  2. The residuals are calculated
  3. The residues are multiplied by a learning rate (ranging from 0 to 1)
  4. The base model output and the residuals after multiplying with certain Lr are added.
  5. Now this value, will be the new prediction for the next Decision tree.

These steps are again repeated for every prediction until the residual values reach the minimum, so basically we are calculating the slope at each prediction to decrease the loss. The same functionality can also be witnessed in gradient descent.

As the residual values are lowered we are getting closer to the actual values of the prices, so gradient boosting tries to predict the next lowest possible residuals. Remember when we get the residual value closer to actual targets then probably the model is overfitting, so this is one issue that need to be taken care of.

[Note: For classification, the loss function used is log loss. We can use other loss functions but log loss is widely used.]

Conclusion:

Gradient boosting is a very efficient methodology that can be applied on a decision tree (we can apply this on any ML model, it work better on DTs) but there are some issues while using this method, one is that the trees are no longer stumps as in AdaBoost, they can grow higher so this can eventually increase the computational time and makes the algorithm slow especially while dealing with super large data sets, another issue is with the overfitting, as the single DT’s can grow higher after a set of DT’s there are chances of overfitting which can mislead the results. These are some downsides that we need to keep in our mind while using Gradient Boosting. In comparison to Adaboost, Gradient boosting is more effective as it produces more accurate results.

--

--

Krishna
Krishna

Written by Krishna

Machine learning | Statistics | Neural Networks | data Visualisation, Data science aspirant, fictional stories, sharing my knowledge through blogs.

No responses yet