Introduction
Tree-based machine learning models are very effective and widely used models in many fields, often people prefer random forests over other classification or regression models to find an optimal solution depending on the problem they want to answer. We need to understand one thing while thinking of building an ML model, every model has its importance and purpose, we cannot use whatever model we want to address a problem, because some models might work good on training data but while testing, they fail to perform as we expect. So, how to decide which model is to build for solving a particular problem?
- One way is to try out different models and compare their results choose one out of them depending on the model’s performance. This might sound convincing but it is very time-consuming and it is impossible to try out different models and compare their results as mentioned earlier every model has its importance. Except to understand the math behind the models we should not take this approach.
- The second way is to try out the bagging technique, which ensemble a set of similar ML models and provides a solution which is been vigorously trained on these models. Now there might be a question what is exactly does this bagging do to achieve this ? and how to implement this to solve our problem?
Well, every person who had started learning ML will come across this term bagging. If we look at the term, it is very self-explanatory that it bags in some set of models to provide an optimal solution which is better compared to the solution provided by a single ML model. So, let’s understand how this bagging works and why is this introduced.
Decision Tree
Now you might be wondering why are we talking about decision tree, well DT don’t use the bagging technique but the idea for implementing bagging came from DT’s. Often DT’s are prone to overfitting, which means they might end up giving high variance and low bias. If you are not familiar with these terms let me give a brief about them.
Overfitting
Look at the picture above, it seems the model is trying to fit all the data points present in the training data, you may think what’s wrong in fitting all the data points? Imagine if a new data point is introduced tomorrow, what could be the scenario, it would have a very high error, which would end up resulting in bad performance and the testing error will be very high. So, this is what is called high variance. While building DT’s if the tree depth is too large the model tries to fit every data point, which might cause this overfitting issue.
So, to avoid this issue bagging was introduced. What does bagging do to overcome this issue?
Imagine, one of your friends is running a restaurant and he/she wanted to classify the reviews as good and bad so that he/she can try to improve the upcoming sales and analyze where the mistake is happening. For this purpose, he contacted you and you have chosen to build a DT, as reviews are continuous, every day the model is provided with new ratings provided by the customers, initially, the model was performing fine, and depending on the reviews which are classified as bad your friend had a chance to correct them. As time passed, the model started to classify the reviews wrongly, even though the ratings provided by the customers were good the model predicted them as bad reviews, now your friend wanted to fix this problem. So, your friend contacted you, after analyzing the problem you have chosen to use the bagging technique to solve this problem. You have ensembles 3–4 DT’s and for each DT you provided randomly selected data(ratings) with replacement and each DT provided a prediction. For each prediction, you have given a ranking, depending on which classification occurred the most number of times you have chosen that. You continued to test this approach by increasing and decreasing the DTs that are ensemble and once you have finalized the perfect number of DT’s you have locked that and deployed that model to solve the problem that your friend was facing.
This is what exactly bagging means, we can use a certain number of models to address the classification problem, as each model is given a randomly selected sample of data, the model is not prone to overfitting, which solves the problem and also improves the overall performance of the model. One of the models that use the bagging technique is Random forest, which can be used for classification as well as prediction.
The process of randomly selecting some rows from your dataset with replacement is called bootstrapping and the process of ranking the prediction and selecting the class depending on the ranks is called aggregation. (Note: selecting rows with replacement will be discussed in OOB section)
How to select the number of estimators: (estimators are the models)
If we select more decision trees, at one point the results produced will correlate. This means if we have selected the number of estimators as 120, then the first estimator output and the 90th or 91st tree’s output will be the same(Note: these numbers are just to explain). To understand, let’s plot a graph between the number of estimates and the error produced.
From the above figure, we can observe at some point the error is leveling- off which means it is not changing anymore. So depending on this plot, we can select our estimators.
We have discussed only the first reason why we use Random Forrest over DT’s, there is a second reason too i.e, for feature selection. In DT’s, there is a feature called automatic feature selection, which means, one is no need to specifically select the important features in the dataset for classification rather the DT is capable of selecting on its own. But this comes with a disadvantage, that is missing out on some features, for instance, if we are working on a classification problem and used DT for this, as we know that DT is capable of selecting imp features on its own, we didn’t prefer to perform this prior. We built the model, if in near future the feature that the model didn’t consider started impacting the classification it could become a problem. This can be overcome by using a bagging-based model.
How to select the features? :
If we have N features generally there are two approaches to select the best features,
- sqrt(N)
- N/3
Note that these are just starting points for selecting the features, which means that if there are 9 features in our dataset, then if we take the sqrt(9) which results in 3. So 3 is the minimum number of features that we need to consider, and then depending on the number of estimators and using domain knowledge we can increase the features.
OOB(Out of bag error):
People who have started learning and implementing the random Forrest model would be very much confused with this OOB. Out-of-bag error is a very simple and powerful metric to understand our model’s performance. So let’s understand what does it mean and how to implement this.
While performing bootstrapping, we have discussed that rows are randomly selected with replacement, so what does this mean?
For simplicity we consider an example of a dataset of 25 rows and 6 columns in which one is the target column, if we try to implement a bagging approach on this dataset for classification, then the following will happen:
Consider the original dataset of 25 rows as D, now from these 25 rows, 5 rows are randomly selected with replacement, which means we have selected 1,2,2,3,4 rows and assigned it to a DT1 and next we selected 4,6,7,7,8 rows and assigned to another DT2, like this we created 5 DT’s and for each DT we have provided the rows with replacement and this process is called bootstrapping which we already discussed above. So, if we observe by the end certain rows are never assigned to a DT, which means the model never seen these data (Note: this is done on training data initially), here comes the OOB, we calculate the error concerning the leftover rows and validate our model before the testing data is provided to the model. This is called out of bag error, this helps to validate our model before providing testing data and improvising the performance until the error cannot be further minimized. This is a great advantage while using random forests on our data.
Conclusion:
As RF comes with these many advantages over other classification models, ML engineers often temp to use this model to address the problem. But as discussed before applying any ML model we need to visualize our data and decide which could be better, because every model has its advantage and disadvantage. As RF solves the problem of overfitting doesn’t mean we can apply it to all kinds of data, If we do so, without visualizing and analyzing the data, then it would cost us in the form of computational time because bagging based models take comparatively higher computational time than other classification models.