Introduction to Ensembling techniques- Boosting

5 min readAug 5, 2019

In my previous blog, we talked about an ensembling technique bagging. In this blog we will talk about another method- boosting. Boosting is basically creating a strong model from a weak learning algorithm.

What is a weak learner?

When an algorithm gives an accuracy of about 55% or 60%, compared to a random guess which may be about 50%, then we call it a weak learner. Just like any other algorithm, this weak learning algorithm would also have a loss function. We can define it as below

With boosting techniques, we will increase the accuracy to about 90% or 95%. Let’s look into how we can achieve this.

Weak learner example

A decision tree with just 1 level called a decision stump is a classic example of a weak learner. If the data set has 60% labels of class A and 40% labels of class B, then the decision tree with just 1 level, would divide the data based on a feature that gives the maximum value for the split. In this case the accuracy of the model would be very minimum ~50%. So, these type of weak learners can be converted into strong learners using some boosting algorithms.

Adaboost:

Adaboost = Adaptive Boosting. This algorithm was the first boosting algorithm and has proven to be very effective. For adaboost, there are 2 main steps.

We need to consider a distribution of the data, which is nothing but a subset of the existing data. In each iteration, we will consider a different distribution of the data i.e deriving distribution d(T+1) from d(T) where ‘T’ is the iteration number.
Let us consider we ran the algorithm for T iterations, then at each iteration we generated a model H(1), H(2)….H(T) on their respective data distributions d(1),d(2)….d(T). Now, we need to combine all these T models to give us 1 strong model. That can be using the below formula to compute the final model,

Step 1:

We need to redefine the above loss function because we are considering only a small distribution for each iteration. At the end of each iteration in we want the loss to be as minimum as possible.

We introduce the probability term here because depending on the predicted values, we either increase or decrease the probability accordingly. We will give more weight-age to the points where errors are occurring more. To calculate the probability at each iteration we use the below formula,

Here yᵢ is the actual value, Hₜ(xᵢ) is the predicted value at time ‘t’. If predicted value is 1 and actual is 1 or predicted value is -1 and actual value is -1, the value yᵢHₜ(xᵢ) will be 1 or yᵢHₜ(xᵢ)>0. If they were not correctly predicted then yᵢHₜ(xᵢ)<0. The formula for zₜ is below, (we consider it to be a constant)

There are 2 possible cases,

In the case where labels are correctly classified, The value of 𝞪>0, then the whole term 𝞪yᵢHₜ(xᵢ)>0 , exp(-𝞪yᵢHₜ(xᵢ)) is going to be very small and hence the Probability value is also going to be less- meaning for correct classification we give less weight-age
In the case where labels are incorrectly classified, The value of 𝞪>0, then the whole term 𝞪yᵢHₜ(xᵢ)<0 , exp(-𝞪yᵢHₜ(xᵢ)) is going to be very large and hence the Probability value is also going to be more- meaning for incorrect classification we give more weight-age

Step 2:

Now, we need to calculate 𝞪. This is given by the formula below.

where 𝝐ₜ is the error at time ‘t’.

How is 𝝐ₜ calculated?

It is calculated using the below formula,

We consider summing up all the probabilities only where the prediction is incorrect. Since this is a weak learning model, the total error should be less than 50%. When 𝝐ₜ <0.5, the value of 𝞪ₜ>0.

This is how every iteration in adaboost proceeds and finally gives us a strong model. It has proven to be effective even for regression problem.

XG Boost: Gradient Boosting

This boosting is mainly used for regression models. This name comes because it uses the partial differentiation as in Gradient descent. In any regression model, we have some function ‘F’ that is used for predicting the output variable. This function may be sigmoid, linear, quadratic, etc. For the gradient boosting model, we will consider a weak learner and follow the below steps,

For every iteration:

Initialize crude function Fo
For t=0 to t-1 (where t is the number of trees)
Calculate the Loss L(yᵢ,Fₜ(xᵢ)) which is the difference between predicted and actual value (also called residual)
Compute the negative gradients, -∂L(yᵢ,Fᵢ)/∂Fₜ
Fit the model hₜ₊₁ on the new target values
Compute the next function Fₜ₊₁=Fₜ+ λₜhₜ₊₁
Fₜ will be the final model

In short,at every iteration the model adds an incremental model which fits on the negative gradient of the loss. This incremental model can be a linear regression or a decision tree. Similarly, the loss function can be any model loss function.

Conclusion:

Boosting algorithms have been proven to very effective in boosting weak learners in both classification and regression problems. It is also extremely less complicated compared to any other neural network models. XG Boost has been used widely in many applications. When you check the kaggle leader board, many of the people would have used some boosting technique to get the best results.

Thanks for reading my blog! Post any questions in the comments or reach out to me via LinkedIn