Machine Learning Ensembling techniques- Bagging

Madhu Ramiah
4 min readMay 31, 2019

--

Many of us would come across the name Random Forest while reading about machine learning techniques. It is one of the most popular machine learning algorithms that uses an ensemble technique-bagging. In this blog we are going to discuss about what are ensemble methods, what is bagging, how bagging is beneficial, what is Random Forest and its advantages.

What is Ensemble methods?

In ensemble methods, we create a number of ML algorithms like Logistic Regression, SVM, KNN, Decision Trees,etc and combine these algorithms together to find the prediction.

There are certain rules that we need to follow while creating an ensemble model,

  1. Diversity- All the models that we have created should be diverse and independent of each other. Each model that we would have created can have different features but all of them should be independent.
  2. Acceptability- All the models should be acceptable and should perform good to some extent. We can assure this by evaluating against a random model and check if our model performs better than it.

How do we combine different models together?

If we consider 100 models M1,M2,M3…..M100, then for a classification we go by a majority vote. Let us consider a binary classification model, if 65 of the models give the prediction of a data point as 1, then that is the majority and so we will go ahead with 1 as the predicted output. In case of a regression model, we can choose the mean or mode of all of the 100 models and consider that as the predicted output.

How will an ensemble model perform better than any individual model?

  1. In an ensemble model we assure that all the models M1, M2, M3…….M100 are all acceptable, which means they have an accuracy greater than at least 50%. So, the combined majority output will surely have a probability greater than 50%, so the ensemble model would perform better.
  2. It will also reduce the possibility of over fitting. The reason for this is even if 1 model overfits, the possibility that the remaining 50% of the models that is going to constitute towards the majority is also going to over fit is very minimal.

What is bagging?

Bagging comes from the words Bootstrap + AGGregatING. We have 3 steps in this process.

  1. We take ‘t’ samples by using row sampling with replacement (doesn’t matter if 1 sample has row 2, there can be another sample with row 2 which are totally independent) from our original training data called D1,D2… Dt- Bootstrapping.
  2. Once we have those samples, we build different classification models on each of them C1,C2…..Ct.
  3. Now we combine these classifiers together by taking a majority vote for classification & mean/mode for regression problems- Aggregating
Bagging

How does bagging help with over fitting?

Let us consider ‘k’ rows in the train data were contributing towards over fitting. Now, when we split the train data into ‘t’ data sets, the likelihood that all the ‘k’ rows would go into all the ‘t’ data sets is very minimal. So, it’s not likely that the combined classifier will be over fitted.

What is Random Forest?

Random Forest uses the bagging technique that we discussed earlier. But the only difference is that all the models M1,M2…..M100 are all decision trees. But now we may think that there is not ‘diversity’ in all the models. This is how we handle this problem

  1. In general we consider only row sampling to accomplish diversity, but here we will consider a subset of features (column sampling) as well.
  2. If we have our training data with ’n’ rows and ‘d’ columns, then each of our data sets in bagging would have m<n rows and k<d columns, so the size would be m*k for each data set. Now, we have made all our models diverse from each other.

Advantages of Random Forest:

  1. All the trees in a random forest are diverse and independent of each other
  2. It is very stable because the majority vote is taken combining the results of all those trees
  3. It is not prone to over fitting
  4. It is immune to the curse of dimensionality- since we take only a subset of rows & columns the feature space is reduced considerably
  5. It is parallelizable- since all the trees are independent of each other, we can run each model separately and so it is parallelizable
  6. Out of Bag Points- If we have ’n’ rows and ‘m’ of them are part of one data set, then the remaining ‘n-m’ rows are out of bag points. We can use this as the validation set and need not create a new one.

Run Time:

  1. Run time depends upon 3 things- the number of trees (t), the depth of each tree (d) and the number of rows in each tree (m)
  2. Run time for training is O(k*mlogd)

--

--

Madhu Ramiah
Madhu Ramiah

Written by Madhu Ramiah

Data Scientist who loves teaching!

No responses yet